You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Brian Bockelman (JIRA)" <ji...@apache.org> on 2009/01/10 18:45:59 UTC
[jira] Created: (HADOOP-5012) addStoredBlock should take into
account corrupted blocks when determining excesses
addStoredBlock should take into account corrupted blocks when determining excesses
----------------------------------------------------------------------------------
Key: HADOOP-5012
URL: https://issues.apache.org/jira/browse/HADOOP-5012
Project: Hadoop Core
Issue Type: Bug
Reporter: Brian Bockelman
I found another source of corruption on our cluster.
0) Three replicas of a block exist
1) One is recognized as corrupt (3 reps total)
2) Namenode decides to create a new replica. Replication done and addStoredBlock is called (4 reps total)
3) There are too many replicas, so processOverReplicatedBlock is called by addStoredBlock.
4) processOverReplicatedBlock is called, and it decides do invalidate the newly created replica. [Oddly enough, it decides to invalidate the newly created one instead of the one in the corrupted replicas map!]
5) We are in the same state as (1) -- 3 replicas total, 1 of which is still bad.
I believe we can fix this easily -- change numCurrentReplica variable to take into account the number of corrupt replicas.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Resolved: (HADOOP-5012) addStoredBlock should take into
account corrupted blocks when determining excesses
Posted by "Brian Bockelman (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Brian Bockelman resolved HADOOP-5012.
-------------------------------------
Resolution: Duplicate
Duplicate of HADOOP-4742
> addStoredBlock should take into account corrupted blocks when determining excesses
> ----------------------------------------------------------------------------------
>
> Key: HADOOP-5012
> URL: https://issues.apache.org/jira/browse/HADOOP-5012
> Project: Hadoop Core
> Issue Type: Bug
> Reporter: Brian Bockelman
> Attachments: hadoop-5012.patch
>
>
> I found another source of corruption on our cluster.
> 0) Three replicas of a block exist
> 1) One is recognized as corrupt (3 reps total)
> 2) Namenode decides to create a new replica. Replication done and addStoredBlock is called (4 reps total)
> 3) There are too many replicas, so processOverReplicatedBlock is called by addStoredBlock.
> 4) processOverReplicatedBlock is called, and it decides do invalidate the newly created replica. [Oddly enough, it decides to invalidate the newly created one instead of the one in the corrupted replicas map!]
> 5) We are in the same state as (1) -- 3 replicas total, 1 of which is still bad.
> I believe we can fix this easily -- change numCurrentReplica variable to take into account the number of corrupt replicas.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-5012) addStoredBlock should take into
account corrupted blocks when determining excesses
Posted by "Brian Bockelman (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12663621#action_12663621 ]
Brian Bockelman commented on HADOOP-5012:
-----------------------------------------
Hey Hairong,
This is 0.19 + a few unrelated patches.
I have attached the relevant logs below. I guess this patch is another way of solving HADOOP-4742, as you point out. It effectively does the same thing, but 4742 correctly identifies the source of the problem better.
I'll close as a duplicate.
Brian
hadoop-root-namenode-hadoop-name.log.2009-01-05:2009-01-05
18:38:58,399 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
Inconsistent size for block blk_-1361380929156165877_44096 reported
from 172.16.1.164:50010 current size is 67108864 reported size is 0
18:38:58,399 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
Mark new replica blk_-1361380929156165877_44096 from
172.16.1.164:50010as corrupt because its length is shorter than
existing ones
hadoop-root-namenode-hadoop-name.log.2009-01-05:2009-01-05
18:38:58,399 INFO org.apache.hadoop.hdfs.StateChange: BLOCK
NameSystem.addToCorruptReplicasMap: blk_-1361380929156165877 added as
corrupt on 172.16.1.164:50010 by /172.16.1.164
hadoop-root-namenode-hadoop-name.log.2009-01-05:2009-01-05
18:39:28,820 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask
172.16.1.184:50010 to replicate blk_-1361380929156165877_44096 to
datanode(s) 172.16.1.90:50010
hadoop-root-namenode-hadoop-name.log.2009-01-05:2009-01-05
18:39:40,465 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.addStoredBlock: blockMap updated: 172.16.1.90:50010 is
added to blk_-1361380929156165877_44096 size 67108864
hadoop-root-namenode-hadoop-name.log.2009-01-05:2009-01-05
18:40:47,088 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask
172.16.1.90:50010 to replicate blk_-1361380929156165877_44096 to
datanode(s) 172.16.1.158:50010
hadoop-root-namenode-hadoop-name.log.2009-01-05:2009-01-05
18:40:53,107 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.addStoredBlock: blockMap updated: 172.16.1.158:50010 is
added to blk_-1361380929156165877_44096 size 67108864
hadoop-root-namenode-hadoop-name.log.2009-01-05:2009-01-05
18:40:53,107 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.chooseExcessReplicates: (172.16.1.90:50010,
blk_-1361380929156165877_44096) is added to recentInvalidateSets
hadoop-root-namenode-hadoop-name.log.2009-01-05:2009-01-05
18:40:53,107 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.chooseExcessReplicates: (172.16.1.184:50010,
blk_-1361380929156165877_44096) is added to recentInvalidateSets
hadoop-root-namenode-hadoop-name.log.2009-01-05:2009-01-05
18:40:53,107 INFO org.apache.hadoop.hdfs.StateChange: DIR*
NameSystem.invalidateBlock: blk_-1361380929156165877_44096 on
172.16.1.164:50010
hadoop-root-namenode-hadoop-name.log.2009-01-05:2009-01-05
18:40:53,107 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.invalidateBlocks: blk_-1361380929156165877_44096 on
172.16.1.164:50010 is the only copy and was not deleted.
hadoop-root-namenode-hadoop-name.log.2009-01-05:2009-01-05
18:40:53,107 INFO org.apache.hadoop.hdfs.StateChange: DIR*
NameSystem.invalidateBlock: blk_-1361380929156165877_44096 on
172.16.1.165:50010
hadoop-root-namenode-hadoop-name.log.2009-01-05:2009-01-05
18:40:53,107 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.invalidateBlocks: blk_-1361380929156165877_44096 on
172.16.1.165:50010 is the only copy and was not deleted.
> addStoredBlock should take into account corrupted blocks when determining excesses
> ----------------------------------------------------------------------------------
>
> Key: HADOOP-5012
> URL: https://issues.apache.org/jira/browse/HADOOP-5012
> Project: Hadoop Core
> Issue Type: Bug
> Reporter: Brian Bockelman
> Attachments: hadoop-5012.patch
>
>
> I found another source of corruption on our cluster.
> 0) Three replicas of a block exist
> 1) One is recognized as corrupt (3 reps total)
> 2) Namenode decides to create a new replica. Replication done and addStoredBlock is called (4 reps total)
> 3) There are too many replicas, so processOverReplicatedBlock is called by addStoredBlock.
> 4) processOverReplicatedBlock is called, and it decides do invalidate the newly created replica. [Oddly enough, it decides to invalidate the newly created one instead of the one in the corrupted replicas map!]
> 5) We are in the same state as (1) -- 3 replicas total, 1 of which is still bad.
> I believe we can fix this easily -- change numCurrentReplica variable to take into account the number of corrupt replicas.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-5012) addStoredBlock should take into
account corrupted blocks when determining excesses
Posted by "Brian Bockelman (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Brian Bockelman updated HADOOP-5012:
------------------------------------
Attachment: hadoop-5012.patch
Patch for problem as described.
I do have an example of this problem happening in a logfile if someone's interested.
> addStoredBlock should take into account corrupted blocks when determining excesses
> ----------------------------------------------------------------------------------
>
> Key: HADOOP-5012
> URL: https://issues.apache.org/jira/browse/HADOOP-5012
> Project: Hadoop Core
> Issue Type: Bug
> Reporter: Brian Bockelman
> Attachments: hadoop-5012.patch
>
>
> I found another source of corruption on our cluster.
> 0) Three replicas of a block exist
> 1) One is recognized as corrupt (3 reps total)
> 2) Namenode decides to create a new replica. Replication done and addStoredBlock is called (4 reps total)
> 3) There are too many replicas, so processOverReplicatedBlock is called by addStoredBlock.
> 4) processOverReplicatedBlock is called, and it decides do invalidate the newly created replica. [Oddly enough, it decides to invalidate the newly created one instead of the one in the corrupted replicas map!]
> 5) We are in the same state as (1) -- 3 replicas total, 1 of which is still bad.
> I believe we can fix this easily -- change numCurrentReplica variable to take into account the number of corrupt replicas.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-5012) addStoredBlock should take into
account corrupted blocks when determining excesses
Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12663062#action_12663062 ]
Hairong Kuang commented on HADOOP-5012:
---------------------------------------
Which version of Hadoop you are using? NumCurrentReplica does exclude corrupt replicas. Is it possible that your problem is caused by HADOOP-4742?
> addStoredBlock should take into account corrupted blocks when determining excesses
> ----------------------------------------------------------------------------------
>
> Key: HADOOP-5012
> URL: https://issues.apache.org/jira/browse/HADOOP-5012
> Project: Hadoop Core
> Issue Type: Bug
> Reporter: Brian Bockelman
> Attachments: hadoop-5012.patch
>
>
> I found another source of corruption on our cluster.
> 0) Three replicas of a block exist
> 1) One is recognized as corrupt (3 reps total)
> 2) Namenode decides to create a new replica. Replication done and addStoredBlock is called (4 reps total)
> 3) There are too many replicas, so processOverReplicatedBlock is called by addStoredBlock.
> 4) processOverReplicatedBlock is called, and it decides do invalidate the newly created replica. [Oddly enough, it decides to invalidate the newly created one instead of the one in the corrupted replicas map!]
> 5) We are in the same state as (1) -- 3 replicas total, 1 of which is still bad.
> I believe we can fix this easily -- change numCurrentReplica variable to take into account the number of corrupt replicas.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.