You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Brian Bockelman (JIRA)" <ji...@apache.org> on 2009/01/10 18:45:59 UTC

[jira] Created: (HADOOP-5012) addStoredBlock should take into account corrupted blocks when determining excesses

addStoredBlock should take into account corrupted blocks when determining excesses
----------------------------------------------------------------------------------

                 Key: HADOOP-5012
                 URL: https://issues.apache.org/jira/browse/HADOOP-5012
             Project: Hadoop Core
          Issue Type: Bug
            Reporter: Brian Bockelman


I found another source of corruption on our cluster.

0) Three replicas of a block exist
1) One is recognized as corrupt (3 reps total)
2) Namenode decides to create a new replica.  Replication done and addStoredBlock is called (4 reps total)
3) There are too many replicas, so processOverReplicatedBlock is called by addStoredBlock.
4) processOverReplicatedBlock is called, and it decides do invalidate the newly created replica.  [Oddly enough, it decides to invalidate the newly created one instead of the one in the corrupted replicas map!]
5) We are in the same state as (1) -- 3 replicas total, 1 of which is still bad.

I believe we can fix this easily -- change numCurrentReplica variable to take into account the number of corrupt replicas.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (HADOOP-5012) addStoredBlock should take into account corrupted blocks when determining excesses

Posted by "Brian Bockelman (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Brian Bockelman resolved HADOOP-5012.
-------------------------------------

    Resolution: Duplicate

Duplicate of HADOOP-4742

> addStoredBlock should take into account corrupted blocks when determining excesses
> ----------------------------------------------------------------------------------
>
>                 Key: HADOOP-5012
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5012
>             Project: Hadoop Core
>          Issue Type: Bug
>            Reporter: Brian Bockelman
>         Attachments: hadoop-5012.patch
>
>
> I found another source of corruption on our cluster.
> 0) Three replicas of a block exist
> 1) One is recognized as corrupt (3 reps total)
> 2) Namenode decides to create a new replica.  Replication done and addStoredBlock is called (4 reps total)
> 3) There are too many replicas, so processOverReplicatedBlock is called by addStoredBlock.
> 4) processOverReplicatedBlock is called, and it decides do invalidate the newly created replica.  [Oddly enough, it decides to invalidate the newly created one instead of the one in the corrupted replicas map!]
> 5) We are in the same state as (1) -- 3 replicas total, 1 of which is still bad.
> I believe we can fix this easily -- change numCurrentReplica variable to take into account the number of corrupt replicas.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5012) addStoredBlock should take into account corrupted blocks when determining excesses

Posted by "Brian Bockelman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12663621#action_12663621 ] 

Brian Bockelman commented on HADOOP-5012:
-----------------------------------------

Hey Hairong,

This is 0.19 + a few unrelated patches.

I have attached the relevant logs below.  I guess this patch is another way of solving HADOOP-4742, as you point out.  It effectively does the same thing, but 4742 correctly identifies the source of the problem better.

I'll close as a duplicate.

Brian

hadoop-root-namenode-hadoop-name.log.2009-01-05:2009-01-05
18:38:58,399 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
Inconsistent size for block blk_-1361380929156165877_44096 reported
from 172.16.1.164:50010 current size is 67108864 reported size is 0

18:38:58,399 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
Mark new replica blk_-1361380929156165877_44096 from
172.16.1.164:50010as corrupt because its length is shorter than
existing ones

hadoop-root-namenode-hadoop-name.log.2009-01-05:2009-01-05
18:38:58,399 INFO org.apache.hadoop.hdfs.StateChange: BLOCK
NameSystem.addToCorruptReplicasMap: blk_-1361380929156165877 added as
corrupt on 172.16.1.164:50010 by /172.16.1.164

hadoop-root-namenode-hadoop-name.log.2009-01-05:2009-01-05
18:39:28,820 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask
172.16.1.184:50010 to replicate blk_-1361380929156165877_44096 to
datanode(s) 172.16.1.90:50010

hadoop-root-namenode-hadoop-name.log.2009-01-05:2009-01-05
18:39:40,465 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.addStoredBlock: blockMap updated: 172.16.1.90:50010 is
added to blk_-1361380929156165877_44096 size 67108864

hadoop-root-namenode-hadoop-name.log.2009-01-05:2009-01-05
18:40:47,088 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask
172.16.1.90:50010 to replicate blk_-1361380929156165877_44096 to
datanode(s) 172.16.1.158:50010

hadoop-root-namenode-hadoop-name.log.2009-01-05:2009-01-05
18:40:53,107 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.addStoredBlock: blockMap updated: 172.16.1.158:50010 is
added to blk_-1361380929156165877_44096 size 67108864

hadoop-root-namenode-hadoop-name.log.2009-01-05:2009-01-05
18:40:53,107 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.chooseExcessReplicates: (172.16.1.90:50010,
blk_-1361380929156165877_44096) is added to recentInvalidateSets

hadoop-root-namenode-hadoop-name.log.2009-01-05:2009-01-05
18:40:53,107 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.chooseExcessReplicates: (172.16.1.184:50010,
blk_-1361380929156165877_44096) is added to recentInvalidateSets

hadoop-root-namenode-hadoop-name.log.2009-01-05:2009-01-05
18:40:53,107 INFO org.apache.hadoop.hdfs.StateChange: DIR*
NameSystem.invalidateBlock: blk_-1361380929156165877_44096 on
172.16.1.164:50010

hadoop-root-namenode-hadoop-name.log.2009-01-05:2009-01-05
18:40:53,107 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.invalidateBlocks: blk_-1361380929156165877_44096 on
172.16.1.164:50010 is the only copy and was not deleted.

hadoop-root-namenode-hadoop-name.log.2009-01-05:2009-01-05
18:40:53,107 INFO org.apache.hadoop.hdfs.StateChange: DIR*
NameSystem.invalidateBlock: blk_-1361380929156165877_44096 on
172.16.1.165:50010

hadoop-root-namenode-hadoop-name.log.2009-01-05:2009-01-05
18:40:53,107 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.invalidateBlocks: blk_-1361380929156165877_44096 on
172.16.1.165:50010 is the only copy and was not deleted.


> addStoredBlock should take into account corrupted blocks when determining excesses
> ----------------------------------------------------------------------------------
>
>                 Key: HADOOP-5012
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5012
>             Project: Hadoop Core
>          Issue Type: Bug
>            Reporter: Brian Bockelman
>         Attachments: hadoop-5012.patch
>
>
> I found another source of corruption on our cluster.
> 0) Three replicas of a block exist
> 1) One is recognized as corrupt (3 reps total)
> 2) Namenode decides to create a new replica.  Replication done and addStoredBlock is called (4 reps total)
> 3) There are too many replicas, so processOverReplicatedBlock is called by addStoredBlock.
> 4) processOverReplicatedBlock is called, and it decides do invalidate the newly created replica.  [Oddly enough, it decides to invalidate the newly created one instead of the one in the corrupted replicas map!]
> 5) We are in the same state as (1) -- 3 replicas total, 1 of which is still bad.
> I believe we can fix this easily -- change numCurrentReplica variable to take into account the number of corrupt replicas.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-5012) addStoredBlock should take into account corrupted blocks when determining excesses

Posted by "Brian Bockelman (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Brian Bockelman updated HADOOP-5012:
------------------------------------

    Attachment: hadoop-5012.patch

Patch for problem as described.

I do have an example of this problem happening in a logfile if someone's interested.

> addStoredBlock should take into account corrupted blocks when determining excesses
> ----------------------------------------------------------------------------------
>
>                 Key: HADOOP-5012
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5012
>             Project: Hadoop Core
>          Issue Type: Bug
>            Reporter: Brian Bockelman
>         Attachments: hadoop-5012.patch
>
>
> I found another source of corruption on our cluster.
> 0) Three replicas of a block exist
> 1) One is recognized as corrupt (3 reps total)
> 2) Namenode decides to create a new replica.  Replication done and addStoredBlock is called (4 reps total)
> 3) There are too many replicas, so processOverReplicatedBlock is called by addStoredBlock.
> 4) processOverReplicatedBlock is called, and it decides do invalidate the newly created replica.  [Oddly enough, it decides to invalidate the newly created one instead of the one in the corrupted replicas map!]
> 5) We are in the same state as (1) -- 3 replicas total, 1 of which is still bad.
> I believe we can fix this easily -- change numCurrentReplica variable to take into account the number of corrupt replicas.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5012) addStoredBlock should take into account corrupted blocks when determining excesses

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12663062#action_12663062 ] 

Hairong Kuang commented on HADOOP-5012:
---------------------------------------

Which version of Hadoop you are using? NumCurrentReplica does exclude corrupt replicas. Is it possible that your problem is caused by HADOOP-4742?

> addStoredBlock should take into account corrupted blocks when determining excesses
> ----------------------------------------------------------------------------------
>
>                 Key: HADOOP-5012
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5012
>             Project: Hadoop Core
>          Issue Type: Bug
>            Reporter: Brian Bockelman
>         Attachments: hadoop-5012.patch
>
>
> I found another source of corruption on our cluster.
> 0) Three replicas of a block exist
> 1) One is recognized as corrupt (3 reps total)
> 2) Namenode decides to create a new replica.  Replication done and addStoredBlock is called (4 reps total)
> 3) There are too many replicas, so processOverReplicatedBlock is called by addStoredBlock.
> 4) processOverReplicatedBlock is called, and it decides do invalidate the newly created replica.  [Oddly enough, it decides to invalidate the newly created one instead of the one in the corrupted replicas map!]
> 5) We are in the same state as (1) -- 3 replicas total, 1 of which is still bad.
> I believe we can fix this easily -- change numCurrentReplica variable to take into account the number of corrupt replicas.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.