You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Wang Xu (JIRA)" <ji...@apache.org> on 2008/11/29 08:50:44 UTC

[jira] Created: (HADOOP-4742) Mistake delete replica in hadoop 0.18.1

Mistake delete replica in hadoop 0.18.1
---------------------------------------

                 Key: HADOOP-4742
                 URL: https://issues.apache.org/jira/browse/HADOOP-4742
             Project: Hadoop Core
          Issue Type: Bug
          Components: dfs
    Affects Versions: 0.18.1
         Environment: CentOS 5.2, JDK 1.6, 
16 Datanodes and 1 Namenodes, each has 8GB Memory and a 4-core CPU, connected by GigabyteEthernet
            Reporter: Wang Xu


We recently deployed a 0.18.1 cluster and did some test. And we found
if we corrupt a block, the namenode will find it and replicate it as soon as
a client read that block. However, the namenode will delete a health block
(the source of the above replication operation) at the same time, (I think this
issue may affect all 0.18 tree.)

Having did some trace, I find in FSNamesystem.addStoredBlock(), it will
check the number of replications after add the block to blocksMap:

 |   NumberReplicas num = countNodes(storedBlock);
 |    int numLiveReplicas = num.liveReplicas();
 |    int numCurrentReplica = numLiveReplicas
 |      + pendingReplications.getNumReplicas(block);

which means all the live replicas and pending replications will be
counted. But in the end of FSNamesystem.blockReceived(), which
calls the addStoredBlock(), it will call addStoredBlock() first, then
reduce the pendingReplications count.

 |    //
 |    // Modify the blocks->datanode map and node's map.
 |    //
 |    addStoredBlock(block, node, delHintNode );
 |    pendingReplications.remove(block);

Hence, the newly replicated replica will be counted twice, and then
will be marked as excess and lead to a mistake deletion.

I think change the counting lines in   blockReceived(), may solve this
issue:

--- FSNamesystem.java-orig      2008-11-28 13:34:40.000000000 +0800
+++ FSNamesystem.java   2008-11-28 13:54:12.000000000 +0800
@@ -3152,8 +3152,8 @@
    //
    // Modify the blocks->datanode map and node's map.
    //
-    addStoredBlock(block, node, delHintNode );
    pendingReplications.remove(block);
+    addStoredBlock(block, node, delHintNode );
  }

  long[] getStats() throws IOException {

The following is the logs for the mistake deletion, with additional
logging info inserted by me.

2008-11-28 11:22:08,866 INFO org.apache.hadoop.dfs.StateChange: *DIR*
NameNode.reportBadBlocks
2008-11-28 11:22:08,866 INFO org.apache.hadoop.dfs.StateChange: BLOCK
NameSystem.addToCorruptReplicasMap: blk_3828935579548953768 added as
corrupt on 192.168.33.51:50010 by /192.168.33.51
2008-11-28 11:22:10,179 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
ask 192.168.33.50:50010 to replicate blk_3828935579548953768_1184 to
datanode(s) 192.168.33.45:50010
2008-11-28 11:22:12,629 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
NameSystem.addStoredBlock: blockMap updated: 192.168.33.45:50010 is
added to blk_3828935579548953768_1184 size 67108864
2008-11-28 11:22:12,629 INFO org.apache.hadoop.dfs.StateChange: Wang
Xu* NameSystem.addStoredBlock: current replicas 4 in which has 1
pendings
2008-11-28 11:22:12,630 INFO org.apache.hadoop.dfs.StateChange: DIR*
NameSystem.invalidateBlock: blk_3828935579548953768_1184 on
192.168.33.51:50010
2008-11-28 11:22:12,630 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
NameSystem.delete: blk_3828935579548953768 is added to invalidSet of
192.168.33.51:50010
2008-11-28 11:22:13,180 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
ask 192.168.33.44:50010 to delete  blk_3828935579548953768_1184
2008-11-28 11:22:13,181 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
ask 192.168.33.51:50010 to delete  blk_3828935579548953768_1184



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-4742) Mistake delete replica in hadoop 0.18.1

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-4742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang updated HADOOP-4742:
----------------------------------

    Attachment: blockReceived-br18.patch

This is the patch for branch 0.18.

> Mistake delete replica in hadoop 0.18.1
> ---------------------------------------
>
>                 Key: HADOOP-4742
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4742
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.18.1
>         Environment: CentOS 5.2, JDK 1.6, 
> 16 Datanodes and 1 Namenodes, each has 8GB Memory and a 4-core CPU, connected by GigabyteEthernet
>            Reporter: Wang Xu
>            Assignee: Wang Xu
>            Priority: Blocker
>             Fix For: 0.18.3
>
>         Attachments: blockReceived-br18.patch, blockReceived.patch, HADOOP-4742.diff
>
>
> We recently deployed a 0.18.1 cluster and did some test. And we found
> if we corrupt a block, the namenode will find it and replicate it as soon as
> a client read that block. However, the namenode will delete a health block
> (the source of the above replication operation) at the same time, (I think this
> issue may affect all 0.18 tree.)
> Having did some trace, I find in FSNamesystem.addStoredBlock(), it will
> check the number of replications after add the block to blocksMap:
>  |   NumberReplicas num = countNodes(storedBlock);
>  |    int numLiveReplicas = num.liveReplicas();
>  |    int numCurrentReplica = numLiveReplicas
>  |      + pendingReplications.getNumReplicas(block);
> which means all the live replicas and pending replications will be
> counted. But in the end of FSNamesystem.blockReceived(), which
> calls the addStoredBlock(), it will call addStoredBlock() first, then
> reduce the pendingReplications count.
>  |    //
>  |    // Modify the blocks->datanode map and node's map.
>  |    //
>  |    addStoredBlock(block, node, delHintNode );
>  |    pendingReplications.remove(block);
> Hence, the newly replicated replica will be counted twice, and then
> will be marked as excess and lead to a mistake deletion.
> I think change the counting lines in   blockReceived(), may solve this
> issue:
> --- FSNamesystem.java-orig      2008-11-28 13:34:40.000000000 +0800
> +++ FSNamesystem.java   2008-11-28 13:54:12.000000000 +0800
> @@ -3152,8 +3152,8 @@
>     //
>     // Modify the blocks->datanode map and node's map.
>     //
> -    addStoredBlock(block, node, delHintNode );
>     pendingReplications.remove(block);
> +    addStoredBlock(block, node, delHintNode );
>   }
>   long[] getStats() throws IOException {
> The following is the logs for the mistake deletion, with additional
> logging info inserted by me.
> 2008-11-28 11:22:08,866 INFO org.apache.hadoop.dfs.StateChange: *DIR*
> NameNode.reportBadBlocks
> 2008-11-28 11:22:08,866 INFO org.apache.hadoop.dfs.StateChange: BLOCK
> NameSystem.addToCorruptReplicasMap: blk_3828935579548953768 added as
> corrupt on 192.168.33.51:50010 by /192.168.33.51
> 2008-11-28 11:22:10,179 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> ask 192.168.33.50:50010 to replicate blk_3828935579548953768_1184 to
> datanode(s) 192.168.33.45:50010
> 2008-11-28 11:22:12,629 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: blockMap updated: 192.168.33.45:50010 is
> added to blk_3828935579548953768_1184 size 67108864
> 2008-11-28 11:22:12,629 INFO org.apache.hadoop.dfs.StateChange: Wang
> Xu* NameSystem.addStoredBlock: current replicas 4 in which has 1
> pendings
> 2008-11-28 11:22:12,630 INFO org.apache.hadoop.dfs.StateChange: DIR*
> NameSystem.invalidateBlock: blk_3828935579548953768_1184 on
> 192.168.33.51:50010
> 2008-11-28 11:22:12,630 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> NameSystem.delete: blk_3828935579548953768 is added to invalidSet of
> 192.168.33.51:50010
> 2008-11-28 11:22:13,180 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> ask 192.168.33.44:50010 to delete  blk_3828935579548953768_1184
> 2008-11-28 11:22:13,181 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> ask 192.168.33.51:50010 to delete  blk_3828935579548953768_1184

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-4742) Mistake delete replica in hadoop 0.18.1

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-4742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang updated HADOOP-4742:
----------------------------------

    Attachment:     (was: blockReceived.patch)

> Mistake delete replica in hadoop 0.18.1
> ---------------------------------------
>
>                 Key: HADOOP-4742
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4742
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.18.1
>         Environment: CentOS 5.2, JDK 1.6, 
> 16 Datanodes and 1 Namenodes, each has 8GB Memory and a 4-core CPU, connected by GigabyteEthernet
>            Reporter: Wang Xu
>            Assignee: Wang Xu
>            Priority: Blocker
>             Fix For: 0.18.3
>
>         Attachments: blockReceived-br18.patch, blockReceived.patch, HADOOP-4742.diff
>
>
> We recently deployed a 0.18.1 cluster and did some test. And we found
> if we corrupt a block, the namenode will find it and replicate it as soon as
> a client read that block. However, the namenode will delete a health block
> (the source of the above replication operation) at the same time, (I think this
> issue may affect all 0.18 tree.)
> Having did some trace, I find in FSNamesystem.addStoredBlock(), it will
> check the number of replications after add the block to blocksMap:
>  |   NumberReplicas num = countNodes(storedBlock);
>  |    int numLiveReplicas = num.liveReplicas();
>  |    int numCurrentReplica = numLiveReplicas
>  |      + pendingReplications.getNumReplicas(block);
> which means all the live replicas and pending replications will be
> counted. But in the end of FSNamesystem.blockReceived(), which
> calls the addStoredBlock(), it will call addStoredBlock() first, then
> reduce the pendingReplications count.
>  |    //
>  |    // Modify the blocks->datanode map and node's map.
>  |    //
>  |    addStoredBlock(block, node, delHintNode );
>  |    pendingReplications.remove(block);
> Hence, the newly replicated replica will be counted twice, and then
> will be marked as excess and lead to a mistake deletion.
> I think change the counting lines in   blockReceived(), may solve this
> issue:
> --- FSNamesystem.java-orig      2008-11-28 13:34:40.000000000 +0800
> +++ FSNamesystem.java   2008-11-28 13:54:12.000000000 +0800
> @@ -3152,8 +3152,8 @@
>     //
>     // Modify the blocks->datanode map and node's map.
>     //
> -    addStoredBlock(block, node, delHintNode );
>     pendingReplications.remove(block);
> +    addStoredBlock(block, node, delHintNode );
>   }
>   long[] getStats() throws IOException {
> The following is the logs for the mistake deletion, with additional
> logging info inserted by me.
> 2008-11-28 11:22:08,866 INFO org.apache.hadoop.dfs.StateChange: *DIR*
> NameNode.reportBadBlocks
> 2008-11-28 11:22:08,866 INFO org.apache.hadoop.dfs.StateChange: BLOCK
> NameSystem.addToCorruptReplicasMap: blk_3828935579548953768 added as
> corrupt on 192.168.33.51:50010 by /192.168.33.51
> 2008-11-28 11:22:10,179 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> ask 192.168.33.50:50010 to replicate blk_3828935579548953768_1184 to
> datanode(s) 192.168.33.45:50010
> 2008-11-28 11:22:12,629 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: blockMap updated: 192.168.33.45:50010 is
> added to blk_3828935579548953768_1184 size 67108864
> 2008-11-28 11:22:12,629 INFO org.apache.hadoop.dfs.StateChange: Wang
> Xu* NameSystem.addStoredBlock: current replicas 4 in which has 1
> pendings
> 2008-11-28 11:22:12,630 INFO org.apache.hadoop.dfs.StateChange: DIR*
> NameSystem.invalidateBlock: blk_3828935579548953768_1184 on
> 192.168.33.51:50010
> 2008-11-28 11:22:12,630 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> NameSystem.delete: blk_3828935579548953768 is added to invalidSet of
> 192.168.33.51:50010
> 2008-11-28 11:22:13,180 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> ask 192.168.33.44:50010 to delete  blk_3828935579548953768_1184
> 2008-11-28 11:22:13,181 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> ask 192.168.33.51:50010 to delete  blk_3828935579548953768_1184

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4742) Mistake delete replica in hadoop 0.18.1

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654089#action_12654089 ] 

Hudson commented on HADOOP-4742:
--------------------------------

Integrated in Hadoop-trunk #680 (See [http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/680/])
    

> Mistake delete replica in hadoop 0.18.1
> ---------------------------------------
>
>                 Key: HADOOP-4742
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4742
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.18.1
>         Environment: CentOS 5.2, JDK 1.6, 
> 16 Datanodes and 1 Namenodes, each has 8GB Memory and a 4-core CPU, connected by GigabyteEthernet
>            Reporter: Wang Xu
>            Assignee: Wang Xu
>            Priority: Blocker
>             Fix For: 0.18.3
>
>         Attachments: blockReceived-br18.patch, blockReceived.patch, HADOOP-4742.diff
>
>
> We recently deployed a 0.18.1 cluster and did some test. And we found
> if we corrupt a block, the namenode will find it and replicate it as soon as
> a client read that block. However, the namenode will delete a health block
> (the source of the above replication operation) at the same time, (I think this
> issue may affect all 0.18 tree.)
> Having did some trace, I find in FSNamesystem.addStoredBlock(), it will
> check the number of replications after add the block to blocksMap:
>  |   NumberReplicas num = countNodes(storedBlock);
>  |    int numLiveReplicas = num.liveReplicas();
>  |    int numCurrentReplica = numLiveReplicas
>  |      + pendingReplications.getNumReplicas(block);
> which means all the live replicas and pending replications will be
> counted. But in the end of FSNamesystem.blockReceived(), which
> calls the addStoredBlock(), it will call addStoredBlock() first, then
> reduce the pendingReplications count.
>  |    //
>  |    // Modify the blocks->datanode map and node's map.
>  |    //
>  |    addStoredBlock(block, node, delHintNode );
>  |    pendingReplications.remove(block);
> Hence, the newly replicated replica will be counted twice, and then
> will be marked as excess and lead to a mistake deletion.
> I think change the counting lines in   blockReceived(), may solve this
> issue:
> --- FSNamesystem.java-orig      2008-11-28 13:34:40.000000000 +0800
> +++ FSNamesystem.java   2008-11-28 13:54:12.000000000 +0800
> @@ -3152,8 +3152,8 @@
>     //
>     // Modify the blocks->datanode map and node's map.
>     //
> -    addStoredBlock(block, node, delHintNode );
>     pendingReplications.remove(block);
> +    addStoredBlock(block, node, delHintNode );
>   }
>   long[] getStats() throws IOException {
> The following is the logs for the mistake deletion, with additional
> logging info inserted by me.
> 2008-11-28 11:22:08,866 INFO org.apache.hadoop.dfs.StateChange: *DIR*
> NameNode.reportBadBlocks
> 2008-11-28 11:22:08,866 INFO org.apache.hadoop.dfs.StateChange: BLOCK
> NameSystem.addToCorruptReplicasMap: blk_3828935579548953768 added as
> corrupt on 192.168.33.51:50010 by /192.168.33.51
> 2008-11-28 11:22:10,179 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> ask 192.168.33.50:50010 to replicate blk_3828935579548953768_1184 to
> datanode(s) 192.168.33.45:50010
> 2008-11-28 11:22:12,629 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: blockMap updated: 192.168.33.45:50010 is
> added to blk_3828935579548953768_1184 size 67108864
> 2008-11-28 11:22:12,629 INFO org.apache.hadoop.dfs.StateChange: Wang
> Xu* NameSystem.addStoredBlock: current replicas 4 in which has 1
> pendings
> 2008-11-28 11:22:12,630 INFO org.apache.hadoop.dfs.StateChange: DIR*
> NameSystem.invalidateBlock: blk_3828935579548953768_1184 on
> 192.168.33.51:50010
> 2008-11-28 11:22:12,630 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> NameSystem.delete: blk_3828935579548953768 is added to invalidSet of
> 192.168.33.51:50010
> 2008-11-28 11:22:13,180 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> ask 192.168.33.44:50010 to delete  blk_3828935579548953768_1184
> 2008-11-28 11:22:13,181 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> ask 192.168.33.51:50010 to delete  blk_3828935579548953768_1184

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-4742) Mistake delete replica in hadoop 0.18.1

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-4742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang updated HADOOP-4742:
----------------------------------

    Attachment: blockReceived.patch

> Mistake delete replica in hadoop 0.18.1
> ---------------------------------------
>
>                 Key: HADOOP-4742
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4742
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.18.1
>         Environment: CentOS 5.2, JDK 1.6, 
> 16 Datanodes and 1 Namenodes, each has 8GB Memory and a 4-core CPU, connected by GigabyteEthernet
>            Reporter: Wang Xu
>            Assignee: Wang Xu
>            Priority: Blocker
>             Fix For: 0.18.3
>
>         Attachments: blockReceived-br18.patch, blockReceived.patch, HADOOP-4742.diff
>
>
> We recently deployed a 0.18.1 cluster and did some test. And we found
> if we corrupt a block, the namenode will find it and replicate it as soon as
> a client read that block. However, the namenode will delete a health block
> (the source of the above replication operation) at the same time, (I think this
> issue may affect all 0.18 tree.)
> Having did some trace, I find in FSNamesystem.addStoredBlock(), it will
> check the number of replications after add the block to blocksMap:
>  |   NumberReplicas num = countNodes(storedBlock);
>  |    int numLiveReplicas = num.liveReplicas();
>  |    int numCurrentReplica = numLiveReplicas
>  |      + pendingReplications.getNumReplicas(block);
> which means all the live replicas and pending replications will be
> counted. But in the end of FSNamesystem.blockReceived(), which
> calls the addStoredBlock(), it will call addStoredBlock() first, then
> reduce the pendingReplications count.
>  |    //
>  |    // Modify the blocks->datanode map and node's map.
>  |    //
>  |    addStoredBlock(block, node, delHintNode );
>  |    pendingReplications.remove(block);
> Hence, the newly replicated replica will be counted twice, and then
> will be marked as excess and lead to a mistake deletion.
> I think change the counting lines in   blockReceived(), may solve this
> issue:
> --- FSNamesystem.java-orig      2008-11-28 13:34:40.000000000 +0800
> +++ FSNamesystem.java   2008-11-28 13:54:12.000000000 +0800
> @@ -3152,8 +3152,8 @@
>     //
>     // Modify the blocks->datanode map and node's map.
>     //
> -    addStoredBlock(block, node, delHintNode );
>     pendingReplications.remove(block);
> +    addStoredBlock(block, node, delHintNode );
>   }
>   long[] getStats() throws IOException {
> The following is the logs for the mistake deletion, with additional
> logging info inserted by me.
> 2008-11-28 11:22:08,866 INFO org.apache.hadoop.dfs.StateChange: *DIR*
> NameNode.reportBadBlocks
> 2008-11-28 11:22:08,866 INFO org.apache.hadoop.dfs.StateChange: BLOCK
> NameSystem.addToCorruptReplicasMap: blk_3828935579548953768 added as
> corrupt on 192.168.33.51:50010 by /192.168.33.51
> 2008-11-28 11:22:10,179 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> ask 192.168.33.50:50010 to replicate blk_3828935579548953768_1184 to
> datanode(s) 192.168.33.45:50010
> 2008-11-28 11:22:12,629 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: blockMap updated: 192.168.33.45:50010 is
> added to blk_3828935579548953768_1184 size 67108864
> 2008-11-28 11:22:12,629 INFO org.apache.hadoop.dfs.StateChange: Wang
> Xu* NameSystem.addStoredBlock: current replicas 4 in which has 1
> pendings
> 2008-11-28 11:22:12,630 INFO org.apache.hadoop.dfs.StateChange: DIR*
> NameSystem.invalidateBlock: blk_3828935579548953768_1184 on
> 192.168.33.51:50010
> 2008-11-28 11:22:12,630 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> NameSystem.delete: blk_3828935579548953768 is added to invalidSet of
> 192.168.33.51:50010
> 2008-11-28 11:22:13,180 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> ask 192.168.33.44:50010 to delete  blk_3828935579548953768_1184
> 2008-11-28 11:22:13,181 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> ask 192.168.33.51:50010 to delete  blk_3828935579548953768_1184

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-4742) Mistake delete replica in hadoop 0.18.1

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-4742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang updated HADOOP-4742:
----------------------------------

    Attachment: blockReceived.patch

Thanks Wang for your contribution. I redid the patch against the trunk.

> Mistake delete replica in hadoop 0.18.1
> ---------------------------------------
>
>                 Key: HADOOP-4742
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4742
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.18.1
>         Environment: CentOS 5.2, JDK 1.6, 
> 16 Datanodes and 1 Namenodes, each has 8GB Memory and a 4-core CPU, connected by GigabyteEthernet
>            Reporter: Wang Xu
>            Assignee: Wang Xu
>            Priority: Blocker
>             Fix For: 0.18.3
>
>         Attachments: blockReceived.patch, HADOOP-4742.diff
>
>
> We recently deployed a 0.18.1 cluster and did some test. And we found
> if we corrupt a block, the namenode will find it and replicate it as soon as
> a client read that block. However, the namenode will delete a health block
> (the source of the above replication operation) at the same time, (I think this
> issue may affect all 0.18 tree.)
> Having did some trace, I find in FSNamesystem.addStoredBlock(), it will
> check the number of replications after add the block to blocksMap:
>  |   NumberReplicas num = countNodes(storedBlock);
>  |    int numLiveReplicas = num.liveReplicas();
>  |    int numCurrentReplica = numLiveReplicas
>  |      + pendingReplications.getNumReplicas(block);
> which means all the live replicas and pending replications will be
> counted. But in the end of FSNamesystem.blockReceived(), which
> calls the addStoredBlock(), it will call addStoredBlock() first, then
> reduce the pendingReplications count.
>  |    //
>  |    // Modify the blocks->datanode map and node's map.
>  |    //
>  |    addStoredBlock(block, node, delHintNode );
>  |    pendingReplications.remove(block);
> Hence, the newly replicated replica will be counted twice, and then
> will be marked as excess and lead to a mistake deletion.
> I think change the counting lines in   blockReceived(), may solve this
> issue:
> --- FSNamesystem.java-orig      2008-11-28 13:34:40.000000000 +0800
> +++ FSNamesystem.java   2008-11-28 13:54:12.000000000 +0800
> @@ -3152,8 +3152,8 @@
>     //
>     // Modify the blocks->datanode map and node's map.
>     //
> -    addStoredBlock(block, node, delHintNode );
>     pendingReplications.remove(block);
> +    addStoredBlock(block, node, delHintNode );
>   }
>   long[] getStats() throws IOException {
> The following is the logs for the mistake deletion, with additional
> logging info inserted by me.
> 2008-11-28 11:22:08,866 INFO org.apache.hadoop.dfs.StateChange: *DIR*
> NameNode.reportBadBlocks
> 2008-11-28 11:22:08,866 INFO org.apache.hadoop.dfs.StateChange: BLOCK
> NameSystem.addToCorruptReplicasMap: blk_3828935579548953768 added as
> corrupt on 192.168.33.51:50010 by /192.168.33.51
> 2008-11-28 11:22:10,179 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> ask 192.168.33.50:50010 to replicate blk_3828935579548953768_1184 to
> datanode(s) 192.168.33.45:50010
> 2008-11-28 11:22:12,629 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: blockMap updated: 192.168.33.45:50010 is
> added to blk_3828935579548953768_1184 size 67108864
> 2008-11-28 11:22:12,629 INFO org.apache.hadoop.dfs.StateChange: Wang
> Xu* NameSystem.addStoredBlock: current replicas 4 in which has 1
> pendings
> 2008-11-28 11:22:12,630 INFO org.apache.hadoop.dfs.StateChange: DIR*
> NameSystem.invalidateBlock: blk_3828935579548953768_1184 on
> 192.168.33.51:50010
> 2008-11-28 11:22:12,630 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> NameSystem.delete: blk_3828935579548953768 is added to invalidSet of
> 192.168.33.51:50010
> 2008-11-28 11:22:13,180 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> ask 192.168.33.44:50010 to delete  blk_3828935579548953768_1184
> 2008-11-28 11:22:13,181 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> ask 192.168.33.51:50010 to delete  blk_3828935579548953768_1184

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-4742) Mistake delete replica in hadoop 0.18.1

Posted by "Nigel Daley (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-4742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nigel Daley updated HADOOP-4742:
--------------------------------

         Priority: Blocker  (was: Major)
    Fix Version/s: 0.18.3
         Assignee: Hairong Kuang

> Mistake delete replica in hadoop 0.18.1
> ---------------------------------------
>
>                 Key: HADOOP-4742
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4742
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.18.1
>         Environment: CentOS 5.2, JDK 1.6, 
> 16 Datanodes and 1 Namenodes, each has 8GB Memory and a 4-core CPU, connected by GigabyteEthernet
>            Reporter: Wang Xu
>            Assignee: Hairong Kuang
>            Priority: Blocker
>             Fix For: 0.18.3
>
>
> We recently deployed a 0.18.1 cluster and did some test. And we found
> if we corrupt a block, the namenode will find it and replicate it as soon as
> a client read that block. However, the namenode will delete a health block
> (the source of the above replication operation) at the same time, (I think this
> issue may affect all 0.18 tree.)
> Having did some trace, I find in FSNamesystem.addStoredBlock(), it will
> check the number of replications after add the block to blocksMap:
>  |   NumberReplicas num = countNodes(storedBlock);
>  |    int numLiveReplicas = num.liveReplicas();
>  |    int numCurrentReplica = numLiveReplicas
>  |      + pendingReplications.getNumReplicas(block);
> which means all the live replicas and pending replications will be
> counted. But in the end of FSNamesystem.blockReceived(), which
> calls the addStoredBlock(), it will call addStoredBlock() first, then
> reduce the pendingReplications count.
>  |    //
>  |    // Modify the blocks->datanode map and node's map.
>  |    //
>  |    addStoredBlock(block, node, delHintNode );
>  |    pendingReplications.remove(block);
> Hence, the newly replicated replica will be counted twice, and then
> will be marked as excess and lead to a mistake deletion.
> I think change the counting lines in   blockReceived(), may solve this
> issue:
> --- FSNamesystem.java-orig      2008-11-28 13:34:40.000000000 +0800
> +++ FSNamesystem.java   2008-11-28 13:54:12.000000000 +0800
> @@ -3152,8 +3152,8 @@
>     //
>     // Modify the blocks->datanode map and node's map.
>     //
> -    addStoredBlock(block, node, delHintNode );
>     pendingReplications.remove(block);
> +    addStoredBlock(block, node, delHintNode );
>   }
>   long[] getStats() throws IOException {
> The following is the logs for the mistake deletion, with additional
> logging info inserted by me.
> 2008-11-28 11:22:08,866 INFO org.apache.hadoop.dfs.StateChange: *DIR*
> NameNode.reportBadBlocks
> 2008-11-28 11:22:08,866 INFO org.apache.hadoop.dfs.StateChange: BLOCK
> NameSystem.addToCorruptReplicasMap: blk_3828935579548953768 added as
> corrupt on 192.168.33.51:50010 by /192.168.33.51
> 2008-11-28 11:22:10,179 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> ask 192.168.33.50:50010 to replicate blk_3828935579548953768_1184 to
> datanode(s) 192.168.33.45:50010
> 2008-11-28 11:22:12,629 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: blockMap updated: 192.168.33.45:50010 is
> added to blk_3828935579548953768_1184 size 67108864
> 2008-11-28 11:22:12,629 INFO org.apache.hadoop.dfs.StateChange: Wang
> Xu* NameSystem.addStoredBlock: current replicas 4 in which has 1
> pendings
> 2008-11-28 11:22:12,630 INFO org.apache.hadoop.dfs.StateChange: DIR*
> NameSystem.invalidateBlock: blk_3828935579548953768_1184 on
> 192.168.33.51:50010
> 2008-11-28 11:22:12,630 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> NameSystem.delete: blk_3828935579548953768 is added to invalidSet of
> 192.168.33.51:50010
> 2008-11-28 11:22:13,180 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> ask 192.168.33.44:50010 to delete  blk_3828935579548953768_1184
> 2008-11-28 11:22:13,181 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> ask 192.168.33.51:50010 to delete  blk_3828935579548953768_1184

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4742) Mistake delete replica in hadoop 0.18.1

Posted by "Wang Xu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653985#action_12653985 ] 

Wang Xu commented on HADOOP-4742:
---------------------------------

Thanks Hairong! I learned much the issue progress of hadoop here. And I think I will do more next time. :)

> Mistake delete replica in hadoop 0.18.1
> ---------------------------------------
>
>                 Key: HADOOP-4742
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4742
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.18.1
>         Environment: CentOS 5.2, JDK 1.6, 
> 16 Datanodes and 1 Namenodes, each has 8GB Memory and a 4-core CPU, connected by GigabyteEthernet
>            Reporter: Wang Xu
>            Assignee: Wang Xu
>            Priority: Blocker
>             Fix For: 0.18.3
>
>         Attachments: blockReceived-br18.patch, blockReceived.patch, HADOOP-4742.diff
>
>
> We recently deployed a 0.18.1 cluster and did some test. And we found
> if we corrupt a block, the namenode will find it and replicate it as soon as
> a client read that block. However, the namenode will delete a health block
> (the source of the above replication operation) at the same time, (I think this
> issue may affect all 0.18 tree.)
> Having did some trace, I find in FSNamesystem.addStoredBlock(), it will
> check the number of replications after add the block to blocksMap:
>  |   NumberReplicas num = countNodes(storedBlock);
>  |    int numLiveReplicas = num.liveReplicas();
>  |    int numCurrentReplica = numLiveReplicas
>  |      + pendingReplications.getNumReplicas(block);
> which means all the live replicas and pending replications will be
> counted. But in the end of FSNamesystem.blockReceived(), which
> calls the addStoredBlock(), it will call addStoredBlock() first, then
> reduce the pendingReplications count.
>  |    //
>  |    // Modify the blocks->datanode map and node's map.
>  |    //
>  |    addStoredBlock(block, node, delHintNode );
>  |    pendingReplications.remove(block);
> Hence, the newly replicated replica will be counted twice, and then
> will be marked as excess and lead to a mistake deletion.
> I think change the counting lines in   blockReceived(), may solve this
> issue:
> --- FSNamesystem.java-orig      2008-11-28 13:34:40.000000000 +0800
> +++ FSNamesystem.java   2008-11-28 13:54:12.000000000 +0800
> @@ -3152,8 +3152,8 @@
>     //
>     // Modify the blocks->datanode map and node's map.
>     //
> -    addStoredBlock(block, node, delHintNode );
>     pendingReplications.remove(block);
> +    addStoredBlock(block, node, delHintNode );
>   }
>   long[] getStats() throws IOException {
> The following is the logs for the mistake deletion, with additional
> logging info inserted by me.
> 2008-11-28 11:22:08,866 INFO org.apache.hadoop.dfs.StateChange: *DIR*
> NameNode.reportBadBlocks
> 2008-11-28 11:22:08,866 INFO org.apache.hadoop.dfs.StateChange: BLOCK
> NameSystem.addToCorruptReplicasMap: blk_3828935579548953768 added as
> corrupt on 192.168.33.51:50010 by /192.168.33.51
> 2008-11-28 11:22:10,179 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> ask 192.168.33.50:50010 to replicate blk_3828935579548953768_1184 to
> datanode(s) 192.168.33.45:50010
> 2008-11-28 11:22:12,629 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: blockMap updated: 192.168.33.45:50010 is
> added to blk_3828935579548953768_1184 size 67108864
> 2008-11-28 11:22:12,629 INFO org.apache.hadoop.dfs.StateChange: Wang
> Xu* NameSystem.addStoredBlock: current replicas 4 in which has 1
> pendings
> 2008-11-28 11:22:12,630 INFO org.apache.hadoop.dfs.StateChange: DIR*
> NameSystem.invalidateBlock: blk_3828935579548953768_1184 on
> 192.168.33.51:50010
> 2008-11-28 11:22:12,630 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> NameSystem.delete: blk_3828935579548953768 is added to invalidSet of
> 192.168.33.51:50010
> 2008-11-28 11:22:13,180 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> ask 192.168.33.44:50010 to delete  blk_3828935579548953768_1184
> 2008-11-28 11:22:13,181 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> ask 192.168.33.51:50010 to delete  blk_3828935579548953768_1184

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4742) Mistake delete replica in hadoop 0.18.1

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653974#action_12653974 ] 

Hairong Kuang commented on HADOOP-4742:
---------------------------------------

ant test-core succeded:
BUILD SUCCESSFUL
Total time: 115 minutes 14 seconds

ant test-patch result:
     [exec] -1 overall.

     [exec]     +1 @author.  The patch does not contain any @author tags.

     [exec]     -1 tests included.  The patch doesn't appear to include any new or modified tes
ts.
     [exec]                         Please justify why no tests are needed for this patch.

     [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.

     [exec]     +1 javac.  The applied patch does not increase the total number of javac compil
er warnings.

     [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.

     [exec]     +1 Eclipse classpath. The patch retains Eclipse classpath integrity.


> Mistake delete replica in hadoop 0.18.1
> ---------------------------------------
>
>                 Key: HADOOP-4742
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4742
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.18.1
>         Environment: CentOS 5.2, JDK 1.6, 
> 16 Datanodes and 1 Namenodes, each has 8GB Memory and a 4-core CPU, connected by GigabyteEthernet
>            Reporter: Wang Xu
>            Assignee: Wang Xu
>            Priority: Blocker
>             Fix For: 0.18.3
>
>         Attachments: blockReceived-br18.patch, blockReceived.patch, HADOOP-4742.diff
>
>
> We recently deployed a 0.18.1 cluster and did some test. And we found
> if we corrupt a block, the namenode will find it and replicate it as soon as
> a client read that block. However, the namenode will delete a health block
> (the source of the above replication operation) at the same time, (I think this
> issue may affect all 0.18 tree.)
> Having did some trace, I find in FSNamesystem.addStoredBlock(), it will
> check the number of replications after add the block to blocksMap:
>  |   NumberReplicas num = countNodes(storedBlock);
>  |    int numLiveReplicas = num.liveReplicas();
>  |    int numCurrentReplica = numLiveReplicas
>  |      + pendingReplications.getNumReplicas(block);
> which means all the live replicas and pending replications will be
> counted. But in the end of FSNamesystem.blockReceived(), which
> calls the addStoredBlock(), it will call addStoredBlock() first, then
> reduce the pendingReplications count.
>  |    //
>  |    // Modify the blocks->datanode map and node's map.
>  |    //
>  |    addStoredBlock(block, node, delHintNode );
>  |    pendingReplications.remove(block);
> Hence, the newly replicated replica will be counted twice, and then
> will be marked as excess and lead to a mistake deletion.
> I think change the counting lines in   blockReceived(), may solve this
> issue:
> --- FSNamesystem.java-orig      2008-11-28 13:34:40.000000000 +0800
> +++ FSNamesystem.java   2008-11-28 13:54:12.000000000 +0800
> @@ -3152,8 +3152,8 @@
>     //
>     // Modify the blocks->datanode map and node's map.
>     //
> -    addStoredBlock(block, node, delHintNode );
>     pendingReplications.remove(block);
> +    addStoredBlock(block, node, delHintNode );
>   }
>   long[] getStats() throws IOException {
> The following is the logs for the mistake deletion, with additional
> logging info inserted by me.
> 2008-11-28 11:22:08,866 INFO org.apache.hadoop.dfs.StateChange: *DIR*
> NameNode.reportBadBlocks
> 2008-11-28 11:22:08,866 INFO org.apache.hadoop.dfs.StateChange: BLOCK
> NameSystem.addToCorruptReplicasMap: blk_3828935579548953768 added as
> corrupt on 192.168.33.51:50010 by /192.168.33.51
> 2008-11-28 11:22:10,179 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> ask 192.168.33.50:50010 to replicate blk_3828935579548953768_1184 to
> datanode(s) 192.168.33.45:50010
> 2008-11-28 11:22:12,629 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: blockMap updated: 192.168.33.45:50010 is
> added to blk_3828935579548953768_1184 size 67108864
> 2008-11-28 11:22:12,629 INFO org.apache.hadoop.dfs.StateChange: Wang
> Xu* NameSystem.addStoredBlock: current replicas 4 in which has 1
> pendings
> 2008-11-28 11:22:12,630 INFO org.apache.hadoop.dfs.StateChange: DIR*
> NameSystem.invalidateBlock: blk_3828935579548953768_1184 on
> 192.168.33.51:50010
> 2008-11-28 11:22:12,630 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> NameSystem.delete: blk_3828935579548953768 is added to invalidSet of
> 192.168.33.51:50010
> 2008-11-28 11:22:13,180 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> ask 192.168.33.44:50010 to delete  blk_3828935579548953768_1184
> 2008-11-28 11:22:13,181 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> ask 192.168.33.51:50010 to delete  blk_3828935579548953768_1184

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4742) Mistake delete replica in hadoop 0.18.1

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12652102#action_12652102 ] 

Hairong Kuang commented on HADOOP-4742:
---------------------------------------

Yes, I think this is indeed a problem. The proposed solution should be able to fix the problem.

> Mistake delete replica in hadoop 0.18.1
> ---------------------------------------
>
>                 Key: HADOOP-4742
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4742
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.18.1
>         Environment: CentOS 5.2, JDK 1.6, 
> 16 Datanodes and 1 Namenodes, each has 8GB Memory and a 4-core CPU, connected by GigabyteEthernet
>            Reporter: Wang Xu
>            Assignee: Hairong Kuang
>            Priority: Blocker
>             Fix For: 0.18.3
>
>
> We recently deployed a 0.18.1 cluster and did some test. And we found
> if we corrupt a block, the namenode will find it and replicate it as soon as
> a client read that block. However, the namenode will delete a health block
> (the source of the above replication operation) at the same time, (I think this
> issue may affect all 0.18 tree.)
> Having did some trace, I find in FSNamesystem.addStoredBlock(), it will
> check the number of replications after add the block to blocksMap:
>  |   NumberReplicas num = countNodes(storedBlock);
>  |    int numLiveReplicas = num.liveReplicas();
>  |    int numCurrentReplica = numLiveReplicas
>  |      + pendingReplications.getNumReplicas(block);
> which means all the live replicas and pending replications will be
> counted. But in the end of FSNamesystem.blockReceived(), which
> calls the addStoredBlock(), it will call addStoredBlock() first, then
> reduce the pendingReplications count.
>  |    //
>  |    // Modify the blocks->datanode map and node's map.
>  |    //
>  |    addStoredBlock(block, node, delHintNode );
>  |    pendingReplications.remove(block);
> Hence, the newly replicated replica will be counted twice, and then
> will be marked as excess and lead to a mistake deletion.
> I think change the counting lines in   blockReceived(), may solve this
> issue:
> --- FSNamesystem.java-orig      2008-11-28 13:34:40.000000000 +0800
> +++ FSNamesystem.java   2008-11-28 13:54:12.000000000 +0800
> @@ -3152,8 +3152,8 @@
>     //
>     // Modify the blocks->datanode map and node's map.
>     //
> -    addStoredBlock(block, node, delHintNode );
>     pendingReplications.remove(block);
> +    addStoredBlock(block, node, delHintNode );
>   }
>   long[] getStats() throws IOException {
> The following is the logs for the mistake deletion, with additional
> logging info inserted by me.
> 2008-11-28 11:22:08,866 INFO org.apache.hadoop.dfs.StateChange: *DIR*
> NameNode.reportBadBlocks
> 2008-11-28 11:22:08,866 INFO org.apache.hadoop.dfs.StateChange: BLOCK
> NameSystem.addToCorruptReplicasMap: blk_3828935579548953768 added as
> corrupt on 192.168.33.51:50010 by /192.168.33.51
> 2008-11-28 11:22:10,179 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> ask 192.168.33.50:50010 to replicate blk_3828935579548953768_1184 to
> datanode(s) 192.168.33.45:50010
> 2008-11-28 11:22:12,629 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: blockMap updated: 192.168.33.45:50010 is
> added to blk_3828935579548953768_1184 size 67108864
> 2008-11-28 11:22:12,629 INFO org.apache.hadoop.dfs.StateChange: Wang
> Xu* NameSystem.addStoredBlock: current replicas 4 in which has 1
> pendings
> 2008-11-28 11:22:12,630 INFO org.apache.hadoop.dfs.StateChange: DIR*
> NameSystem.invalidateBlock: blk_3828935579548953768_1184 on
> 192.168.33.51:50010
> 2008-11-28 11:22:12,630 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> NameSystem.delete: blk_3828935579548953768 is added to invalidSet of
> 192.168.33.51:50010
> 2008-11-28 11:22:13,180 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> ask 192.168.33.44:50010 to delete  blk_3828935579548953768_1184
> 2008-11-28 11:22:13,181 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> ask 192.168.33.51:50010 to delete  blk_3828935579548953768_1184

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-4742) Mistake delete replica in hadoop 0.18.1

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-4742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang updated HADOOP-4742:
----------------------------------

      Resolution: Fixed
    Hadoop Flags: [Reviewed]
          Status: Resolved  (was: Patch Available)

I've just committed this. Thanks you, Wang!

> Mistake delete replica in hadoop 0.18.1
> ---------------------------------------
>
>                 Key: HADOOP-4742
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4742
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.18.1
>         Environment: CentOS 5.2, JDK 1.6, 
> 16 Datanodes and 1 Namenodes, each has 8GB Memory and a 4-core CPU, connected by GigabyteEthernet
>            Reporter: Wang Xu
>            Assignee: Wang Xu
>            Priority: Blocker
>             Fix For: 0.18.3
>
>         Attachments: blockReceived-br18.patch, blockReceived.patch, HADOOP-4742.diff
>
>
> We recently deployed a 0.18.1 cluster and did some test. And we found
> if we corrupt a block, the namenode will find it and replicate it as soon as
> a client read that block. However, the namenode will delete a health block
> (the source of the above replication operation) at the same time, (I think this
> issue may affect all 0.18 tree.)
> Having did some trace, I find in FSNamesystem.addStoredBlock(), it will
> check the number of replications after add the block to blocksMap:
>  |   NumberReplicas num = countNodes(storedBlock);
>  |    int numLiveReplicas = num.liveReplicas();
>  |    int numCurrentReplica = numLiveReplicas
>  |      + pendingReplications.getNumReplicas(block);
> which means all the live replicas and pending replications will be
> counted. But in the end of FSNamesystem.blockReceived(), which
> calls the addStoredBlock(), it will call addStoredBlock() first, then
> reduce the pendingReplications count.
>  |    //
>  |    // Modify the blocks->datanode map and node's map.
>  |    //
>  |    addStoredBlock(block, node, delHintNode );
>  |    pendingReplications.remove(block);
> Hence, the newly replicated replica will be counted twice, and then
> will be marked as excess and lead to a mistake deletion.
> I think change the counting lines in   blockReceived(), may solve this
> issue:
> --- FSNamesystem.java-orig      2008-11-28 13:34:40.000000000 +0800
> +++ FSNamesystem.java   2008-11-28 13:54:12.000000000 +0800
> @@ -3152,8 +3152,8 @@
>     //
>     // Modify the blocks->datanode map and node's map.
>     //
> -    addStoredBlock(block, node, delHintNode );
>     pendingReplications.remove(block);
> +    addStoredBlock(block, node, delHintNode );
>   }
>   long[] getStats() throws IOException {
> The following is the logs for the mistake deletion, with additional
> logging info inserted by me.
> 2008-11-28 11:22:08,866 INFO org.apache.hadoop.dfs.StateChange: *DIR*
> NameNode.reportBadBlocks
> 2008-11-28 11:22:08,866 INFO org.apache.hadoop.dfs.StateChange: BLOCK
> NameSystem.addToCorruptReplicasMap: blk_3828935579548953768 added as
> corrupt on 192.168.33.51:50010 by /192.168.33.51
> 2008-11-28 11:22:10,179 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> ask 192.168.33.50:50010 to replicate blk_3828935579548953768_1184 to
> datanode(s) 192.168.33.45:50010
> 2008-11-28 11:22:12,629 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: blockMap updated: 192.168.33.45:50010 is
> added to blk_3828935579548953768_1184 size 67108864
> 2008-11-28 11:22:12,629 INFO org.apache.hadoop.dfs.StateChange: Wang
> Xu* NameSystem.addStoredBlock: current replicas 4 in which has 1
> pendings
> 2008-11-28 11:22:12,630 INFO org.apache.hadoop.dfs.StateChange: DIR*
> NameSystem.invalidateBlock: blk_3828935579548953768_1184 on
> 192.168.33.51:50010
> 2008-11-28 11:22:12,630 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> NameSystem.delete: blk_3828935579548953768 is added to invalidSet of
> 192.168.33.51:50010
> 2008-11-28 11:22:13,180 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> ask 192.168.33.44:50010 to delete  blk_3828935579548953768_1184
> 2008-11-28 11:22:13,181 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> ask 192.168.33.51:50010 to delete  blk_3828935579548953768_1184

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-4742) Mistake delete replica in hadoop 0.18.1

Posted by "Wang Xu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-4742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Wang Xu updated HADOOP-4742:
----------------------------

    Status: Patch Available  (was: Open)

> Mistake delete replica in hadoop 0.18.1
> ---------------------------------------
>
>                 Key: HADOOP-4742
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4742
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.18.1
>         Environment: CentOS 5.2, JDK 1.6, 
> 16 Datanodes and 1 Namenodes, each has 8GB Memory and a 4-core CPU, connected by GigabyteEthernet
>            Reporter: Wang Xu
>            Assignee: Wang Xu
>            Priority: Blocker
>             Fix For: 0.18.3
>
>         Attachments: HADOOP-4742.diff
>
>
> We recently deployed a 0.18.1 cluster and did some test. And we found
> if we corrupt a block, the namenode will find it and replicate it as soon as
> a client read that block. However, the namenode will delete a health block
> (the source of the above replication operation) at the same time, (I think this
> issue may affect all 0.18 tree.)
> Having did some trace, I find in FSNamesystem.addStoredBlock(), it will
> check the number of replications after add the block to blocksMap:
>  |   NumberReplicas num = countNodes(storedBlock);
>  |    int numLiveReplicas = num.liveReplicas();
>  |    int numCurrentReplica = numLiveReplicas
>  |      + pendingReplications.getNumReplicas(block);
> which means all the live replicas and pending replications will be
> counted. But in the end of FSNamesystem.blockReceived(), which
> calls the addStoredBlock(), it will call addStoredBlock() first, then
> reduce the pendingReplications count.
>  |    //
>  |    // Modify the blocks->datanode map and node's map.
>  |    //
>  |    addStoredBlock(block, node, delHintNode );
>  |    pendingReplications.remove(block);
> Hence, the newly replicated replica will be counted twice, and then
> will be marked as excess and lead to a mistake deletion.
> I think change the counting lines in   blockReceived(), may solve this
> issue:
> --- FSNamesystem.java-orig      2008-11-28 13:34:40.000000000 +0800
> +++ FSNamesystem.java   2008-11-28 13:54:12.000000000 +0800
> @@ -3152,8 +3152,8 @@
>     //
>     // Modify the blocks->datanode map and node's map.
>     //
> -    addStoredBlock(block, node, delHintNode );
>     pendingReplications.remove(block);
> +    addStoredBlock(block, node, delHintNode );
>   }
>   long[] getStats() throws IOException {
> The following is the logs for the mistake deletion, with additional
> logging info inserted by me.
> 2008-11-28 11:22:08,866 INFO org.apache.hadoop.dfs.StateChange: *DIR*
> NameNode.reportBadBlocks
> 2008-11-28 11:22:08,866 INFO org.apache.hadoop.dfs.StateChange: BLOCK
> NameSystem.addToCorruptReplicasMap: blk_3828935579548953768 added as
> corrupt on 192.168.33.51:50010 by /192.168.33.51
> 2008-11-28 11:22:10,179 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> ask 192.168.33.50:50010 to replicate blk_3828935579548953768_1184 to
> datanode(s) 192.168.33.45:50010
> 2008-11-28 11:22:12,629 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: blockMap updated: 192.168.33.45:50010 is
> added to blk_3828935579548953768_1184 size 67108864
> 2008-11-28 11:22:12,629 INFO org.apache.hadoop.dfs.StateChange: Wang
> Xu* NameSystem.addStoredBlock: current replicas 4 in which has 1
> pendings
> 2008-11-28 11:22:12,630 INFO org.apache.hadoop.dfs.StateChange: DIR*
> NameSystem.invalidateBlock: blk_3828935579548953768_1184 on
> 192.168.33.51:50010
> 2008-11-28 11:22:12,630 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> NameSystem.delete: blk_3828935579548953768 is added to invalidSet of
> 192.168.33.51:50010
> 2008-11-28 11:22:13,180 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> ask 192.168.33.44:50010 to delete  blk_3828935579548953768_1184
> 2008-11-28 11:22:13,181 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> ask 192.168.33.51:50010 to delete  blk_3828935579548953768_1184

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (HADOOP-4742) Mistake delete replica in hadoop 0.18.1

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-4742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting reassigned HADOOP-4742:
------------------------------------

    Assignee: Wang Xu  (was: Hairong Kuang)

> Mistake delete replica in hadoop 0.18.1
> ---------------------------------------
>
>                 Key: HADOOP-4742
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4742
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.18.1
>         Environment: CentOS 5.2, JDK 1.6, 
> 16 Datanodes and 1 Namenodes, each has 8GB Memory and a 4-core CPU, connected by GigabyteEthernet
>            Reporter: Wang Xu
>            Assignee: Wang Xu
>            Priority: Blocker
>             Fix For: 0.18.3
>
>
> We recently deployed a 0.18.1 cluster and did some test. And we found
> if we corrupt a block, the namenode will find it and replicate it as soon as
> a client read that block. However, the namenode will delete a health block
> (the source of the above replication operation) at the same time, (I think this
> issue may affect all 0.18 tree.)
> Having did some trace, I find in FSNamesystem.addStoredBlock(), it will
> check the number of replications after add the block to blocksMap:
>  |   NumberReplicas num = countNodes(storedBlock);
>  |    int numLiveReplicas = num.liveReplicas();
>  |    int numCurrentReplica = numLiveReplicas
>  |      + pendingReplications.getNumReplicas(block);
> which means all the live replicas and pending replications will be
> counted. But in the end of FSNamesystem.blockReceived(), which
> calls the addStoredBlock(), it will call addStoredBlock() first, then
> reduce the pendingReplications count.
>  |    //
>  |    // Modify the blocks->datanode map and node's map.
>  |    //
>  |    addStoredBlock(block, node, delHintNode );
>  |    pendingReplications.remove(block);
> Hence, the newly replicated replica will be counted twice, and then
> will be marked as excess and lead to a mistake deletion.
> I think change the counting lines in   blockReceived(), may solve this
> issue:
> --- FSNamesystem.java-orig      2008-11-28 13:34:40.000000000 +0800
> +++ FSNamesystem.java   2008-11-28 13:54:12.000000000 +0800
> @@ -3152,8 +3152,8 @@
>     //
>     // Modify the blocks->datanode map and node's map.
>     //
> -    addStoredBlock(block, node, delHintNode );
>     pendingReplications.remove(block);
> +    addStoredBlock(block, node, delHintNode );
>   }
>   long[] getStats() throws IOException {
> The following is the logs for the mistake deletion, with additional
> logging info inserted by me.
> 2008-11-28 11:22:08,866 INFO org.apache.hadoop.dfs.StateChange: *DIR*
> NameNode.reportBadBlocks
> 2008-11-28 11:22:08,866 INFO org.apache.hadoop.dfs.StateChange: BLOCK
> NameSystem.addToCorruptReplicasMap: blk_3828935579548953768 added as
> corrupt on 192.168.33.51:50010 by /192.168.33.51
> 2008-11-28 11:22:10,179 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> ask 192.168.33.50:50010 to replicate blk_3828935579548953768_1184 to
> datanode(s) 192.168.33.45:50010
> 2008-11-28 11:22:12,629 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: blockMap updated: 192.168.33.45:50010 is
> added to blk_3828935579548953768_1184 size 67108864
> 2008-11-28 11:22:12,629 INFO org.apache.hadoop.dfs.StateChange: Wang
> Xu* NameSystem.addStoredBlock: current replicas 4 in which has 1
> pendings
> 2008-11-28 11:22:12,630 INFO org.apache.hadoop.dfs.StateChange: DIR*
> NameSystem.invalidateBlock: blk_3828935579548953768_1184 on
> 192.168.33.51:50010
> 2008-11-28 11:22:12,630 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> NameSystem.delete: blk_3828935579548953768 is added to invalidSet of
> 192.168.33.51:50010
> 2008-11-28 11:22:13,180 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> ask 192.168.33.44:50010 to delete  blk_3828935579548953768_1184
> 2008-11-28 11:22:13,181 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> ask 192.168.33.51:50010 to delete  blk_3828935579548953768_1184

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-4742) Mistake delete replica in hadoop 0.18.1

Posted by "Wang Xu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-4742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Wang Xu updated HADOOP-4742:
----------------------------

    Attachment: HADOOP-4742.diff

> Mistake delete replica in hadoop 0.18.1
> ---------------------------------------
>
>                 Key: HADOOP-4742
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4742
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.18.1
>         Environment: CentOS 5.2, JDK 1.6, 
> 16 Datanodes and 1 Namenodes, each has 8GB Memory and a 4-core CPU, connected by GigabyteEthernet
>            Reporter: Wang Xu
>            Assignee: Wang Xu
>            Priority: Blocker
>             Fix For: 0.18.3
>
>         Attachments: HADOOP-4742.diff
>
>
> We recently deployed a 0.18.1 cluster and did some test. And we found
> if we corrupt a block, the namenode will find it and replicate it as soon as
> a client read that block. However, the namenode will delete a health block
> (the source of the above replication operation) at the same time, (I think this
> issue may affect all 0.18 tree.)
> Having did some trace, I find in FSNamesystem.addStoredBlock(), it will
> check the number of replications after add the block to blocksMap:
>  |   NumberReplicas num = countNodes(storedBlock);
>  |    int numLiveReplicas = num.liveReplicas();
>  |    int numCurrentReplica = numLiveReplicas
>  |      + pendingReplications.getNumReplicas(block);
> which means all the live replicas and pending replications will be
> counted. But in the end of FSNamesystem.blockReceived(), which
> calls the addStoredBlock(), it will call addStoredBlock() first, then
> reduce the pendingReplications count.
>  |    //
>  |    // Modify the blocks->datanode map and node's map.
>  |    //
>  |    addStoredBlock(block, node, delHintNode );
>  |    pendingReplications.remove(block);
> Hence, the newly replicated replica will be counted twice, and then
> will be marked as excess and lead to a mistake deletion.
> I think change the counting lines in   blockReceived(), may solve this
> issue:
> --- FSNamesystem.java-orig      2008-11-28 13:34:40.000000000 +0800
> +++ FSNamesystem.java   2008-11-28 13:54:12.000000000 +0800
> @@ -3152,8 +3152,8 @@
>     //
>     // Modify the blocks->datanode map and node's map.
>     //
> -    addStoredBlock(block, node, delHintNode );
>     pendingReplications.remove(block);
> +    addStoredBlock(block, node, delHintNode );
>   }
>   long[] getStats() throws IOException {
> The following is the logs for the mistake deletion, with additional
> logging info inserted by me.
> 2008-11-28 11:22:08,866 INFO org.apache.hadoop.dfs.StateChange: *DIR*
> NameNode.reportBadBlocks
> 2008-11-28 11:22:08,866 INFO org.apache.hadoop.dfs.StateChange: BLOCK
> NameSystem.addToCorruptReplicasMap: blk_3828935579548953768 added as
> corrupt on 192.168.33.51:50010 by /192.168.33.51
> 2008-11-28 11:22:10,179 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> ask 192.168.33.50:50010 to replicate blk_3828935579548953768_1184 to
> datanode(s) 192.168.33.45:50010
> 2008-11-28 11:22:12,629 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: blockMap updated: 192.168.33.45:50010 is
> added to blk_3828935579548953768_1184 size 67108864
> 2008-11-28 11:22:12,629 INFO org.apache.hadoop.dfs.StateChange: Wang
> Xu* NameSystem.addStoredBlock: current replicas 4 in which has 1
> pendings
> 2008-11-28 11:22:12,630 INFO org.apache.hadoop.dfs.StateChange: DIR*
> NameSystem.invalidateBlock: blk_3828935579548953768_1184 on
> 192.168.33.51:50010
> 2008-11-28 11:22:12,630 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> NameSystem.delete: blk_3828935579548953768 is added to invalidSet of
> 192.168.33.51:50010
> 2008-11-28 11:22:13,180 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> ask 192.168.33.44:50010 to delete  blk_3828935579548953768_1184
> 2008-11-28 11:22:13,181 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> ask 192.168.33.51:50010 to delete  blk_3828935579548953768_1184

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.