You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Hairong Kuang (JIRA)" <ji...@apache.org> on 2008/11/20 01:25:44 UTC

[jira] Created: (HADOOP-4692) Namenode in infinite loop for replicating/deleting corrupted block

 Namenode in infinite loop for replicating/deleting corrupted block
-------------------------------------------------------------------

                 Key: HADOOP-4692
                 URL: https://issues.apache.org/jira/browse/HADOOP-4692
             Project: Hadoop Core
          Issue Type: Bug
          Components: dfs
    Affects Versions: 0.18.0
            Reporter: Hairong Kuang
             Fix For: 0.20.0


Our cluster has an under-replicated block with only one replica, assuming its block id is B. NameNode log shows that NameNode is in an infinite loop replicating/deleting the block.

INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask DN1 to replicate blk_B to datanode(s) DN2, DN3
WARN org.apache.hadoop.fs.FSNamesystem: Inconsistent size for block blk_B reported from DN2  current size is 134217728 reported size is 134205440
WARN org.apache.hadoop.fs.FSNamesystem: Deleting block blk_B from DN2
INFO org.apache.hadoop.dfs.StateChange: DIR* NameSystem.invalidateBlock: blk_B on DN2
INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.delete: blk_B is added to invalidSet of DN2
INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: DN2 is added to blk_B size 134217728
WARN org.apache.hadoop.fs.FSNamesystem: Inconsistent size for block blk_-B reported from DN3 current size is 134217728 reported size is 134205440
WARN org.apache.hadoop.fs.FSNamesystem: Deleting block blk_B from DN3
INFO org.apache.hadoop.dfs.StateChange: DIR* NameSystem.invalidateBlock: blk_B on DN3
INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.delete: blk_B is added to invalidSet of DN3
INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: DN3 is added to blk_B size 134217728
INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask DN1 to replicate blk_B  to datanode(s) DN4, DN5
...


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4692) Namenode in infinite loop for replicating/deleting corrupted block

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12675925#action_12675925 ] 

Hudson commented on HADOOP-4692:
--------------------------------

Integrated in Hadoop-trunk #763 (See [http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/763/])
    

>  Namenode in infinite loop for replicating/deleting corrupted block
> -------------------------------------------------------------------
>
>                 Key: HADOOP-4692
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4692
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.18.0
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.20.0
>
>         Attachments: mismatchBlockReplication.patch, mismatchBlockReplication1.patch, namenode_inconsistent_size.patch, truncateBlockReplication.patch
>
>
> Our cluster has an under-replicated block with only one replica, assuming its block id is B. NameNode log shows that NameNode is in an infinite loop replicating/deleting the block.
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask DN1 to replicate blk_B to datanode(s) DN2, DN3
> WARN org.apache.hadoop.fs.FSNamesystem: Inconsistent size for block blk_B reported from DN2  current size is 134217728 reported size is 134205440
> WARN org.apache.hadoop.fs.FSNamesystem: Deleting block blk_B from DN2
> INFO org.apache.hadoop.dfs.StateChange: DIR* NameSystem.invalidateBlock: blk_B on DN2
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.delete: blk_B is added to invalidSet of DN2
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: DN2 is added to blk_B size 134217728
> WARN org.apache.hadoop.fs.FSNamesystem: Inconsistent size for block blk_-B reported from DN3 current size is 134217728 reported size is 134205440
> WARN org.apache.hadoop.fs.FSNamesystem: Deleting block blk_B from DN3
> INFO org.apache.hadoop.dfs.StateChange: DIR* NameSystem.invalidateBlock: blk_B on DN3
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.delete: blk_B is added to invalidSet of DN3
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: DN3 is added to blk_B size 134217728
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask DN1 to replicate blk_B  to datanode(s) DN4, DN5
> ...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4692) Namenode in infinite loop for replicating/deleting corrupted block

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12649227#action_12649227 ] 

Hairong Kuang commented on HADOOP-4692:
---------------------------------------

Currently NameNode does not detect that a replica is truncated and therefore corrupted. One way to solve this is to let block report handling check each block's
length and mark those truncated blocks as corrupted. Also when NN receives a new block that's truncated, NN should mark it as corrupted instead of adding it to recent invalidates directly. Once NameNode finds out all replicas are corrupted, it will stop replicating/deleting a block.



>  Namenode in infinite loop for replicating/deleting corrupted block
> -------------------------------------------------------------------
>
>                 Key: HADOOP-4692
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4692
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.18.0
>            Reporter: Hairong Kuang
>             Fix For: 0.20.0
>
>
> Our cluster has an under-replicated block with only one replica, assuming its block id is B. NameNode log shows that NameNode is in an infinite loop replicating/deleting the block.
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask DN1 to replicate blk_B to datanode(s) DN2, DN3
> WARN org.apache.hadoop.fs.FSNamesystem: Inconsistent size for block blk_B reported from DN2  current size is 134217728 reported size is 134205440
> WARN org.apache.hadoop.fs.FSNamesystem: Deleting block blk_B from DN2
> INFO org.apache.hadoop.dfs.StateChange: DIR* NameSystem.invalidateBlock: blk_B on DN2
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.delete: blk_B is added to invalidSet of DN2
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: DN2 is added to blk_B size 134217728
> WARN org.apache.hadoop.fs.FSNamesystem: Inconsistent size for block blk_-B reported from DN3 current size is 134217728 reported size is 134205440
> WARN org.apache.hadoop.fs.FSNamesystem: Deleting block blk_B from DN3
> INFO org.apache.hadoop.dfs.StateChange: DIR* NameSystem.invalidateBlock: blk_B on DN3
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.delete: blk_B is added to invalidSet of DN3
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: DN3 is added to blk_B size 134217728
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask DN1 to replicate blk_B  to datanode(s) DN4, DN5
> ...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4692) Namenode in infinite loop for replicating/deleting corrupted block

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12663956#action_12663956 ] 

dhruba borthakur commented on HADOOP-4692:
------------------------------------------

f the NN and the DN have the same generation stamp, then the file is either not-open or the file is marked as "under construction" at the namenode.
4:47 so, the NN will not start any new replication requests for these blocks (via HADOOP-5027)

>  Namenode in infinite loop for replicating/deleting corrupted block
> -------------------------------------------------------------------
>
>                 Key: HADOOP-4692
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4692
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.18.0
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.20.0
>
>         Attachments: namenode_inconsistent_size.patch, truncateBlockReplication.patch
>
>
> Our cluster has an under-replicated block with only one replica, assuming its block id is B. NameNode log shows that NameNode is in an infinite loop replicating/deleting the block.
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask DN1 to replicate blk_B to datanode(s) DN2, DN3
> WARN org.apache.hadoop.fs.FSNamesystem: Inconsistent size for block blk_B reported from DN2  current size is 134217728 reported size is 134205440
> WARN org.apache.hadoop.fs.FSNamesystem: Deleting block blk_B from DN2
> INFO org.apache.hadoop.dfs.StateChange: DIR* NameSystem.invalidateBlock: blk_B on DN2
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.delete: blk_B is added to invalidSet of DN2
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: DN2 is added to blk_B size 134217728
> WARN org.apache.hadoop.fs.FSNamesystem: Inconsistent size for block blk_-B reported from DN3 current size is 134217728 reported size is 134205440
> WARN org.apache.hadoop.fs.FSNamesystem: Deleting block blk_B from DN3
> INFO org.apache.hadoop.dfs.StateChange: DIR* NameSystem.invalidateBlock: blk_B on DN3
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.delete: blk_B is added to invalidSet of DN3
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: DN3 is added to blk_B size 134217728
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask DN1 to replicate blk_B  to datanode(s) DN4, DN5
> ...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4692) Namenode in infinite loop for replicating/deleting corrupted block

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12663161#action_12663161 ] 

Raghu Angadi commented on HADOOP-4692:
--------------------------------------

+1 for reporting a block as corrupt to NN.
Regd implementation :

The patch makes BlockSender to report the corruption (implicitly assuming that null client implies a transfer). I think this approach mixes higher level policy with lower level implementaion. 

My suggestion would be to make BlockSender throw an excection (it throws IOException now, it could throw TruncatedBlockException in stead). Then make the block transfer thread (in DataNode.java) to catch it and report the corrupt block to NN.



>  Namenode in infinite loop for replicating/deleting corrupted block
> -------------------------------------------------------------------
>
>                 Key: HADOOP-4692
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4692
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.18.0
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.20.0
>
>         Attachments: namenode_inconsistent_size.patch, truncateBlockReplication.patch
>
>
> Our cluster has an under-replicated block with only one replica, assuming its block id is B. NameNode log shows that NameNode is in an infinite loop replicating/deleting the block.
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask DN1 to replicate blk_B to datanode(s) DN2, DN3
> WARN org.apache.hadoop.fs.FSNamesystem: Inconsistent size for block blk_B reported from DN2  current size is 134217728 reported size is 134205440
> WARN org.apache.hadoop.fs.FSNamesystem: Deleting block blk_B from DN2
> INFO org.apache.hadoop.dfs.StateChange: DIR* NameSystem.invalidateBlock: blk_B on DN2
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.delete: blk_B is added to invalidSet of DN2
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: DN2 is added to blk_B size 134217728
> WARN org.apache.hadoop.fs.FSNamesystem: Inconsistent size for block blk_-B reported from DN3 current size is 134217728 reported size is 134205440
> WARN org.apache.hadoop.fs.FSNamesystem: Deleting block blk_B from DN3
> INFO org.apache.hadoop.dfs.StateChange: DIR* NameSystem.invalidateBlock: blk_B on DN3
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.delete: blk_B is added to invalidSet of DN3
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: DN3 is added to blk_B size 134217728
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask DN1 to replicate blk_B  to datanode(s) DN4, DN5
> ...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4692) Namenode in infinite loop for replicating/deleting corrupted block

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12674766#action_12674766 ] 

Raghu Angadi commented on HADOOP-4692:
--------------------------------------

+1. The patch looks good makes good sense to me. 

Regd correctness and how it fits in the larger context of related fixes like HADOOP-5133, HADOO-5027, I haven't looked into much. That area of HDFS is under lot of flux.

btw, the 'links' for the jira says HADOOP-3314 is a duplicate of this, is it still true? Mostly HADOOP-3314 still needs to be fixed.

>  Namenode in infinite loop for replicating/deleting corrupted block
> -------------------------------------------------------------------
>
>                 Key: HADOOP-4692
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4692
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.18.0
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.20.0
>
>         Attachments: mismatchBlockReplication.patch, mismatchBlockReplication1.patch, namenode_inconsistent_size.patch, truncateBlockReplication.patch
>
>
> Our cluster has an under-replicated block with only one replica, assuming its block id is B. NameNode log shows that NameNode is in an infinite loop replicating/deleting the block.
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask DN1 to replicate blk_B to datanode(s) DN2, DN3
> WARN org.apache.hadoop.fs.FSNamesystem: Inconsistent size for block blk_B reported from DN2  current size is 134217728 reported size is 134205440
> WARN org.apache.hadoop.fs.FSNamesystem: Deleting block blk_B from DN2
> INFO org.apache.hadoop.dfs.StateChange: DIR* NameSystem.invalidateBlock: blk_B on DN2
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.delete: blk_B is added to invalidSet of DN2
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: DN2 is added to blk_B size 134217728
> WARN org.apache.hadoop.fs.FSNamesystem: Inconsistent size for block blk_-B reported from DN3 current size is 134217728 reported size is 134205440
> WARN org.apache.hadoop.fs.FSNamesystem: Deleting block blk_B from DN3
> INFO org.apache.hadoop.dfs.StateChange: DIR* NameSystem.invalidateBlock: blk_B on DN3
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.delete: blk_B is added to invalidSet of DN3
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: DN3 is added to blk_B size 134217728
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask DN1 to replicate blk_B  to datanode(s) DN4, DN5
> ...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4692) Namenode in infinite loop for replicating/deleting corrupted block

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12657105#action_12657105 ] 

Hairong Kuang commented on HADOOP-4692:
---------------------------------------

Another idea is to pass the NN recorded block length when replicating a block. (currently -1 is passed). When sender  sees that its on-disk length is less than the asked leghth, report corrupt block to NN and stop replication.

>  Namenode in infinite loop for replicating/deleting corrupted block
> -------------------------------------------------------------------
>
>                 Key: HADOOP-4692
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4692
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.18.0
>            Reporter: Hairong Kuang
>             Fix For: 0.20.0
>
>         Attachments: namenode_inconsistent_size.patch
>
>
> Our cluster has an under-replicated block with only one replica, assuming its block id is B. NameNode log shows that NameNode is in an infinite loop replicating/deleting the block.
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask DN1 to replicate blk_B to datanode(s) DN2, DN3
> WARN org.apache.hadoop.fs.FSNamesystem: Inconsistent size for block blk_B reported from DN2  current size is 134217728 reported size is 134205440
> WARN org.apache.hadoop.fs.FSNamesystem: Deleting block blk_B from DN2
> INFO org.apache.hadoop.dfs.StateChange: DIR* NameSystem.invalidateBlock: blk_B on DN2
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.delete: blk_B is added to invalidSet of DN2
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: DN2 is added to blk_B size 134217728
> WARN org.apache.hadoop.fs.FSNamesystem: Inconsistent size for block blk_-B reported from DN3 current size is 134217728 reported size is 134205440
> WARN org.apache.hadoop.fs.FSNamesystem: Deleting block blk_B from DN3
> INFO org.apache.hadoop.dfs.StateChange: DIR* NameSystem.invalidateBlock: blk_B on DN3
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.delete: blk_B is added to invalidSet of DN3
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: DN3 is added to blk_B size 134217728
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask DN1 to replicate blk_B  to datanode(s) DN4, DN5
> ...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-4692) Namenode in infinite loop for replicating/deleting corrupted block

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-4692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang updated HADOOP-4692:
----------------------------------

    Attachment: mismatchBlockReplication.patch

With this patch, this issue does not depend on HADOOP-5027 any more.

>  Namenode in infinite loop for replicating/deleting corrupted block
> -------------------------------------------------------------------
>
>                 Key: HADOOP-4692
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4692
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.18.0
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.20.0
>
>         Attachments: mismatchBlockReplication.patch, namenode_inconsistent_size.patch, truncateBlockReplication.patch
>
>
> Our cluster has an under-replicated block with only one replica, assuming its block id is B. NameNode log shows that NameNode is in an infinite loop replicating/deleting the block.
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask DN1 to replicate blk_B to datanode(s) DN2, DN3
> WARN org.apache.hadoop.fs.FSNamesystem: Inconsistent size for block blk_B reported from DN2  current size is 134217728 reported size is 134205440
> WARN org.apache.hadoop.fs.FSNamesystem: Deleting block blk_B from DN2
> INFO org.apache.hadoop.dfs.StateChange: DIR* NameSystem.invalidateBlock: blk_B on DN2
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.delete: blk_B is added to invalidSet of DN2
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: DN2 is added to blk_B size 134217728
> WARN org.apache.hadoop.fs.FSNamesystem: Inconsistent size for block blk_-B reported from DN3 current size is 134217728 reported size is 134205440
> WARN org.apache.hadoop.fs.FSNamesystem: Deleting block blk_B from DN3
> INFO org.apache.hadoop.dfs.StateChange: DIR* NameSystem.invalidateBlock: blk_B on DN3
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.delete: blk_B is added to invalidSet of DN3
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: DN3 is added to blk_B size 134217728
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask DN1 to replicate blk_B  to datanode(s) DN4, DN5
> ...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4692) Namenode in infinite loop for replicating/deleting corrupted block

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12667375#action_12667375 ] 

Hairong Kuang commented on HADOOP-4692:
---------------------------------------

     [exec]
     [exec]     +1 @author.  The patch does not contain any @author tags.
     [exec]
     [exec]     +1 tests included.  The patch appears to include 9 new or modified tests.
     [exec]
     [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
     [exec]
     [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
     [exec]
     [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
     [exec]
     [exec]     +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
     [exec]

Ant test-core had the following known failures:
    [junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 1.35 sec
    [junit] Test org.apache.hadoop.http.TestGlobalFilter FAILED
    [junit] Running org.apache.hadoop.mapreduce.TestMapReduceLocal
    [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 29.481 sec

>  Namenode in infinite loop for replicating/deleting corrupted block
> -------------------------------------------------------------------
>
>                 Key: HADOOP-4692
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4692
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.18.0
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.20.0
>
>         Attachments: mismatchBlockReplication.patch, namenode_inconsistent_size.patch, truncateBlockReplication.patch
>
>
> Our cluster has an under-replicated block with only one replica, assuming its block id is B. NameNode log shows that NameNode is in an infinite loop replicating/deleting the block.
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask DN1 to replicate blk_B to datanode(s) DN2, DN3
> WARN org.apache.hadoop.fs.FSNamesystem: Inconsistent size for block blk_B reported from DN2  current size is 134217728 reported size is 134205440
> WARN org.apache.hadoop.fs.FSNamesystem: Deleting block blk_B from DN2
> INFO org.apache.hadoop.dfs.StateChange: DIR* NameSystem.invalidateBlock: blk_B on DN2
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.delete: blk_B is added to invalidSet of DN2
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: DN2 is added to blk_B size 134217728
> WARN org.apache.hadoop.fs.FSNamesystem: Inconsistent size for block blk_-B reported from DN3 current size is 134217728 reported size is 134205440
> WARN org.apache.hadoop.fs.FSNamesystem: Deleting block blk_B from DN3
> INFO org.apache.hadoop.dfs.StateChange: DIR* NameSystem.invalidateBlock: blk_B on DN3
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.delete: blk_B is added to invalidSet of DN3
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: DN3 is added to blk_B size 134217728
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask DN1 to replicate blk_B  to datanode(s) DN4, DN5
> ...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4692) Namenode in infinite loop for replicating/deleting corrupted block

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12663896#action_12663896 ] 

Hairong Kuang commented on HADOOP-4692:
---------------------------------------

> Copying only the bytes requested by NN is ok (as far as NN is concerned).

I am still not sure of this. If the block is being written to, a longer block is also a corrupt block. If the block is being written to, then copying partial data is useless.

Hi Dhruba, could you please clarify if it is possible that ReplicationMonitor may replicate a block that's being written to after the introduction of sync & append?

>  Namenode in infinite loop for replicating/deleting corrupted block
> -------------------------------------------------------------------
>
>                 Key: HADOOP-4692
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4692
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.18.0
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.20.0
>
>         Attachments: namenode_inconsistent_size.patch, truncateBlockReplication.patch
>
>
> Our cluster has an under-replicated block with only one replica, assuming its block id is B. NameNode log shows that NameNode is in an infinite loop replicating/deleting the block.
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask DN1 to replicate blk_B to datanode(s) DN2, DN3
> WARN org.apache.hadoop.fs.FSNamesystem: Inconsistent size for block blk_B reported from DN2  current size is 134217728 reported size is 134205440
> WARN org.apache.hadoop.fs.FSNamesystem: Deleting block blk_B from DN2
> INFO org.apache.hadoop.dfs.StateChange: DIR* NameSystem.invalidateBlock: blk_B on DN2
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.delete: blk_B is added to invalidSet of DN2
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: DN2 is added to blk_B size 134217728
> WARN org.apache.hadoop.fs.FSNamesystem: Inconsistent size for block blk_-B reported from DN3 current size is 134217728 reported size is 134205440
> WARN org.apache.hadoop.fs.FSNamesystem: Deleting block blk_B from DN3
> INFO org.apache.hadoop.dfs.StateChange: DIR* NameSystem.invalidateBlock: blk_B on DN3
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.delete: blk_B is added to invalidSet of DN3
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: DN3 is added to blk_B size 134217728
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask DN1 to replicate blk_B  to datanode(s) DN4, DN5
> ...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4692) Namenode in infinite loop for replicating/deleting corrupted block

Posted by "Brian Bockelman (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12650454#action_12650454 ] 

Brian Bockelman commented on HADOOP-4692:
-----------------------------------------

This is a duplicate of HADOOP-3314, but this really writes up the problem better... can we close the older ticket?

We've run into this issue locally, and it's rather debilitating because it can result in "silent corruptions": these truncations can accumulate for a long time without anything noticing.  If you are running with 2 replicas (hey, not all of us can afford all that raw disk space...) and lose a data node, then this can result in a nasty surprise if the second copy had this truncation problem.

This in fact has caused corruption for about 500 files locally.

>  Namenode in infinite loop for replicating/deleting corrupted block
> -------------------------------------------------------------------
>
>                 Key: HADOOP-4692
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4692
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.18.0
>            Reporter: Hairong Kuang
>             Fix For: 0.20.0
>
>
> Our cluster has an under-replicated block with only one replica, assuming its block id is B. NameNode log shows that NameNode is in an infinite loop replicating/deleting the block.
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask DN1 to replicate blk_B to datanode(s) DN2, DN3
> WARN org.apache.hadoop.fs.FSNamesystem: Inconsistent size for block blk_B reported from DN2  current size is 134217728 reported size is 134205440
> WARN org.apache.hadoop.fs.FSNamesystem: Deleting block blk_B from DN2
> INFO org.apache.hadoop.dfs.StateChange: DIR* NameSystem.invalidateBlock: blk_B on DN2
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.delete: blk_B is added to invalidSet of DN2
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: DN2 is added to blk_B size 134217728
> WARN org.apache.hadoop.fs.FSNamesystem: Inconsistent size for block blk_-B reported from DN3 current size is 134217728 reported size is 134205440
> WARN org.apache.hadoop.fs.FSNamesystem: Deleting block blk_B from DN3
> INFO org.apache.hadoop.dfs.StateChange: DIR* NameSystem.invalidateBlock: blk_B on DN3
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.delete: blk_B is added to invalidSet of DN3
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: DN3 is added to blk_B size 134217728
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask DN1 to replicate blk_B  to datanode(s) DN4, DN5
> ...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-4692) Namenode in infinite loop for replicating/deleting corrupted block

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-4692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang updated HADOOP-4692:
----------------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

I've just committed this.

Regarding Raghu's concern, this patch is based on the assumption that the length of a block (identified by its block id & generation stamp) recorded on the NN's side can only grow but never shrink. I will keep an eye that HADOOP-5133 and HADOOP-5027 observe this assumption.

>  Namenode in infinite loop for replicating/deleting corrupted block
> -------------------------------------------------------------------
>
>                 Key: HADOOP-4692
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4692
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.18.0
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.20.0
>
>         Attachments: mismatchBlockReplication.patch, mismatchBlockReplication1.patch, namenode_inconsistent_size.patch, truncateBlockReplication.patch
>
>
> Our cluster has an under-replicated block with only one replica, assuming its block id is B. NameNode log shows that NameNode is in an infinite loop replicating/deleting the block.
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask DN1 to replicate blk_B to datanode(s) DN2, DN3
> WARN org.apache.hadoop.fs.FSNamesystem: Inconsistent size for block blk_B reported from DN2  current size is 134217728 reported size is 134205440
> WARN org.apache.hadoop.fs.FSNamesystem: Deleting block blk_B from DN2
> INFO org.apache.hadoop.dfs.StateChange: DIR* NameSystem.invalidateBlock: blk_B on DN2
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.delete: blk_B is added to invalidSet of DN2
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: DN2 is added to blk_B size 134217728
> WARN org.apache.hadoop.fs.FSNamesystem: Inconsistent size for block blk_-B reported from DN3 current size is 134217728 reported size is 134205440
> WARN org.apache.hadoop.fs.FSNamesystem: Deleting block blk_B from DN3
> INFO org.apache.hadoop.dfs.StateChange: DIR* NameSystem.invalidateBlock: blk_B on DN3
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.delete: blk_B is added to invalidSet of DN3
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: DN3 is added to blk_B size 134217728
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask DN1 to replicate blk_B  to datanode(s) DN4, DN5
> ...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4692) Namenode in infinite loop for replicating/deleting corrupted block

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12664226#action_12664226 ] 

Tsz Wo (Nicholas), SZE commented on HADOOP-4692:
------------------------------------------------

> so, the NN will not start any new replication requests for these blocks (via HADOOP-5027) 

NN won't start new requests but what if there are scheduled requests?

>  Namenode in infinite loop for replicating/deleting corrupted block
> -------------------------------------------------------------------
>
>                 Key: HADOOP-4692
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4692
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.18.0
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.20.0
>
>         Attachments: namenode_inconsistent_size.patch, truncateBlockReplication.patch
>
>
> Our cluster has an under-replicated block with only one replica, assuming its block id is B. NameNode log shows that NameNode is in an infinite loop replicating/deleting the block.
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask DN1 to replicate blk_B to datanode(s) DN2, DN3
> WARN org.apache.hadoop.fs.FSNamesystem: Inconsistent size for block blk_B reported from DN2  current size is 134217728 reported size is 134205440
> WARN org.apache.hadoop.fs.FSNamesystem: Deleting block blk_B from DN2
> INFO org.apache.hadoop.dfs.StateChange: DIR* NameSystem.invalidateBlock: blk_B on DN2
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.delete: blk_B is added to invalidSet of DN2
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: DN2 is added to blk_B size 134217728
> WARN org.apache.hadoop.fs.FSNamesystem: Inconsistent size for block blk_-B reported from DN3 current size is 134217728 reported size is 134205440
> WARN org.apache.hadoop.fs.FSNamesystem: Deleting block blk_B from DN3
> INFO org.apache.hadoop.dfs.StateChange: DIR* NameSystem.invalidateBlock: blk_B on DN3
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.delete: blk_B is added to invalidSet of DN3
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: DN3 is added to blk_B size 134217728
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask DN1 to replicate blk_B  to datanode(s) DN4, DN5
> ...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4692) Namenode in infinite loop for replicating/deleting corrupted block

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12666301#action_12666301 ] 

Hairong Kuang commented on HADOOP-4692:
---------------------------------------

In the current trunk, the source datanode ignores the block length that NN sent and uses the on-disk block length to transfer the block.

What I plan to do is that when receiving a block replication request, datanode first checks if this block is under construction or not by looking at the ongoingCreates list. If yes, stop replicating the block. Otherwise check if the on-disk block length is the same as the block length sent by NN. If no, report NN corrupt blocks and stop replicating. Otherwise, start replicated the block.

>  Namenode in infinite loop for replicating/deleting corrupted block
> -------------------------------------------------------------------
>
>                 Key: HADOOP-4692
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4692
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.18.0
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.20.0
>
>         Attachments: namenode_inconsistent_size.patch, truncateBlockReplication.patch
>
>
> Our cluster has an under-replicated block with only one replica, assuming its block id is B. NameNode log shows that NameNode is in an infinite loop replicating/deleting the block.
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask DN1 to replicate blk_B to datanode(s) DN2, DN3
> WARN org.apache.hadoop.fs.FSNamesystem: Inconsistent size for block blk_B reported from DN2  current size is 134217728 reported size is 134205440
> WARN org.apache.hadoop.fs.FSNamesystem: Deleting block blk_B from DN2
> INFO org.apache.hadoop.dfs.StateChange: DIR* NameSystem.invalidateBlock: blk_B on DN2
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.delete: blk_B is added to invalidSet of DN2
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: DN2 is added to blk_B size 134217728
> WARN org.apache.hadoop.fs.FSNamesystem: Inconsistent size for block blk_-B reported from DN3 current size is 134217728 reported size is 134205440
> WARN org.apache.hadoop.fs.FSNamesystem: Deleting block blk_B from DN3
> INFO org.apache.hadoop.dfs.StateChange: DIR* NameSystem.invalidateBlock: blk_B on DN3
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.delete: blk_B is added to invalidSet of DN3
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: DN3 is added to blk_B size 134217728
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask DN1 to replicate blk_B  to datanode(s) DN4, DN5
> ...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-4692) Namenode in infinite loop for replicating/deleting corrupted block

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-4692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang updated HADOOP-4692:
----------------------------------

    Status: Patch Available  (was: Open)

>  Namenode in infinite loop for replicating/deleting corrupted block
> -------------------------------------------------------------------
>
>                 Key: HADOOP-4692
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4692
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.18.0
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.20.0
>
>         Attachments: mismatchBlockReplication.patch, mismatchBlockReplication1.patch, namenode_inconsistent_size.patch, truncateBlockReplication.patch
>
>
> Our cluster has an under-replicated block with only one replica, assuming its block id is B. NameNode log shows that NameNode is in an infinite loop replicating/deleting the block.
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask DN1 to replicate blk_B to datanode(s) DN2, DN3
> WARN org.apache.hadoop.fs.FSNamesystem: Inconsistent size for block blk_B reported from DN2  current size is 134217728 reported size is 134205440
> WARN org.apache.hadoop.fs.FSNamesystem: Deleting block blk_B from DN2
> INFO org.apache.hadoop.dfs.StateChange: DIR* NameSystem.invalidateBlock: blk_B on DN2
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.delete: blk_B is added to invalidSet of DN2
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: DN2 is added to blk_B size 134217728
> WARN org.apache.hadoop.fs.FSNamesystem: Inconsistent size for block blk_-B reported from DN3 current size is 134217728 reported size is 134205440
> WARN org.apache.hadoop.fs.FSNamesystem: Deleting block blk_B from DN3
> INFO org.apache.hadoop.dfs.StateChange: DIR* NameSystem.invalidateBlock: blk_B on DN3
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.delete: blk_B is added to invalidSet of DN3
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: DN3 is added to blk_B size 134217728
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask DN1 to replicate blk_B  to datanode(s) DN4, DN5
> ...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4692) Namenode in infinite loop for replicating/deleting corrupted block

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12664761#action_12664761 ] 

dhruba borthakur commented on HADOOP-4692:
------------------------------------------

My understanding is that the NN will send the block length ( as recorded in NN metadata) to the source datanode of the replication request. The source datanode will verify that this length matches the length of the block file on disk. If it does not match, then the source datanode will not replicate the block. Is my understanding correct?

>  Namenode in infinite loop for replicating/deleting corrupted block
> -------------------------------------------------------------------
>
>                 Key: HADOOP-4692
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4692
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.18.0
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.20.0
>
>         Attachments: namenode_inconsistent_size.patch, truncateBlockReplication.patch
>
>
> Our cluster has an under-replicated block with only one replica, assuming its block id is B. NameNode log shows that NameNode is in an infinite loop replicating/deleting the block.
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask DN1 to replicate blk_B to datanode(s) DN2, DN3
> WARN org.apache.hadoop.fs.FSNamesystem: Inconsistent size for block blk_B reported from DN2  current size is 134217728 reported size is 134205440
> WARN org.apache.hadoop.fs.FSNamesystem: Deleting block blk_B from DN2
> INFO org.apache.hadoop.dfs.StateChange: DIR* NameSystem.invalidateBlock: blk_B on DN2
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.delete: blk_B is added to invalidSet of DN2
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: DN2 is added to blk_B size 134217728
> WARN org.apache.hadoop.fs.FSNamesystem: Inconsistent size for block blk_-B reported from DN3 current size is 134217728 reported size is 134205440
> WARN org.apache.hadoop.fs.FSNamesystem: Deleting block blk_B from DN3
> INFO org.apache.hadoop.dfs.StateChange: DIR* NameSystem.invalidateBlock: blk_B on DN3
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.delete: blk_B is added to invalidSet of DN3
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: DN3 is added to blk_B size 134217728
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask DN1 to replicate blk_B  to datanode(s) DN4, DN5
> ...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-4692) Namenode in infinite loop for replicating/deleting corrupted block

Posted by "Brian Bockelman (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-4692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Brian Bockelman updated HADOOP-4692:
------------------------------------

    Attachment: namenode_inconsistent_size.patch

Attached a file which is my first whack of a patch that builds on HADOOP-4865.

>  Namenode in infinite loop for replicating/deleting corrupted block
> -------------------------------------------------------------------
>
>                 Key: HADOOP-4692
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4692
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.18.0
>            Reporter: Hairong Kuang
>             Fix For: 0.20.0
>
>         Attachments: namenode_inconsistent_size.patch
>
>
> Our cluster has an under-replicated block with only one replica, assuming its block id is B. NameNode log shows that NameNode is in an infinite loop replicating/deleting the block.
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask DN1 to replicate blk_B to datanode(s) DN2, DN3
> WARN org.apache.hadoop.fs.FSNamesystem: Inconsistent size for block blk_B reported from DN2  current size is 134217728 reported size is 134205440
> WARN org.apache.hadoop.fs.FSNamesystem: Deleting block blk_B from DN2
> INFO org.apache.hadoop.dfs.StateChange: DIR* NameSystem.invalidateBlock: blk_B on DN2
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.delete: blk_B is added to invalidSet of DN2
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: DN2 is added to blk_B size 134217728
> WARN org.apache.hadoop.fs.FSNamesystem: Inconsistent size for block blk_-B reported from DN3 current size is 134217728 reported size is 134205440
> WARN org.apache.hadoop.fs.FSNamesystem: Deleting block blk_B from DN3
> INFO org.apache.hadoop.dfs.StateChange: DIR* NameSystem.invalidateBlock: blk_B on DN3
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.delete: blk_B is added to invalidSet of DN3
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: DN3 is added to blk_B size 134217728
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask DN1 to replicate blk_B  to datanode(s) DN4, DN5
> ...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4692) Namenode in infinite loop for replicating/deleting corrupted block

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12663163#action_12663163 ] 

Hairong Kuang commented on HADOOP-4692:
---------------------------------------

Thanks Raghu for the comment. Yes, I like your suggestion. But still with your approach, BiockSender needs to know if the block reading is for block transfer or not by checking if the client name before throwing TruncateBlockException. Would this be OK?

Another question is what should BlockSender do if the on-disk block length is longer than the NN recorded length? Currently block replication only copies the number of bytes recorded by NN. Is this a good idea?

>  Namenode in infinite loop for replicating/deleting corrupted block
> -------------------------------------------------------------------
>
>                 Key: HADOOP-4692
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4692
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.18.0
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.20.0
>
>         Attachments: namenode_inconsistent_size.patch, truncateBlockReplication.patch
>
>
> Our cluster has an under-replicated block with only one replica, assuming its block id is B. NameNode log shows that NameNode is in an infinite loop replicating/deleting the block.
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask DN1 to replicate blk_B to datanode(s) DN2, DN3
> WARN org.apache.hadoop.fs.FSNamesystem: Inconsistent size for block blk_B reported from DN2  current size is 134217728 reported size is 134205440
> WARN org.apache.hadoop.fs.FSNamesystem: Deleting block blk_B from DN2
> INFO org.apache.hadoop.dfs.StateChange: DIR* NameSystem.invalidateBlock: blk_B on DN2
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.delete: blk_B is added to invalidSet of DN2
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: DN2 is added to blk_B size 134217728
> WARN org.apache.hadoop.fs.FSNamesystem: Inconsistent size for block blk_-B reported from DN3 current size is 134217728 reported size is 134205440
> WARN org.apache.hadoop.fs.FSNamesystem: Deleting block blk_B from DN3
> INFO org.apache.hadoop.dfs.StateChange: DIR* NameSystem.invalidateBlock: blk_B on DN3
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.delete: blk_B is added to invalidSet of DN3
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: DN3 is added to blk_B size 134217728
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask DN1 to replicate blk_B  to datanode(s) DN4, DN5
> ...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4692) Namenode in infinite loop for replicating/deleting corrupted block

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12649226#action_12649226 ] 

Hairong Kuang commented on HADOOP-4692:
---------------------------------------

The block file of blk_B on DN1 shows that the on-disk block size is 134205440. So the only replica of this block is truncated and therefore corrupted but reading this block does not cause ChecksumException.

>  Namenode in infinite loop for replicating/deleting corrupted block
> -------------------------------------------------------------------
>
>                 Key: HADOOP-4692
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4692
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.18.0
>            Reporter: Hairong Kuang
>             Fix For: 0.20.0
>
>
> Our cluster has an under-replicated block with only one replica, assuming its block id is B. NameNode log shows that NameNode is in an infinite loop replicating/deleting the block.
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask DN1 to replicate blk_B to datanode(s) DN2, DN3
> WARN org.apache.hadoop.fs.FSNamesystem: Inconsistent size for block blk_B reported from DN2  current size is 134217728 reported size is 134205440
> WARN org.apache.hadoop.fs.FSNamesystem: Deleting block blk_B from DN2
> INFO org.apache.hadoop.dfs.StateChange: DIR* NameSystem.invalidateBlock: blk_B on DN2
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.delete: blk_B is added to invalidSet of DN2
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: DN2 is added to blk_B size 134217728
> WARN org.apache.hadoop.fs.FSNamesystem: Inconsistent size for block blk_-B reported from DN3 current size is 134217728 reported size is 134205440
> WARN org.apache.hadoop.fs.FSNamesystem: Deleting block blk_B from DN3
> INFO org.apache.hadoop.dfs.StateChange: DIR* NameSystem.invalidateBlock: blk_B on DN3
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.delete: blk_B is added to invalidSet of DN3
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: DN3 is added to blk_B size 134217728
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask DN1 to replicate blk_B  to datanode(s) DN4, DN5
> ...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4692) Namenode in infinite loop for replicating/deleting corrupted block

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12674929#action_12674929 ] 

Hadoop QA commented on HADOOP-4692:
-----------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12400441/mismatchBlockReplication1.patch
  against trunk revision 745705.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 6 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3880/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3880/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3880/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3880/console

This message is automatically generated.

>  Namenode in infinite loop for replicating/deleting corrupted block
> -------------------------------------------------------------------
>
>                 Key: HADOOP-4692
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4692
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.18.0
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.20.0
>
>         Attachments: mismatchBlockReplication.patch, mismatchBlockReplication1.patch, namenode_inconsistent_size.patch, truncateBlockReplication.patch
>
>
> Our cluster has an under-replicated block with only one replica, assuming its block id is B. NameNode log shows that NameNode is in an infinite loop replicating/deleting the block.
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask DN1 to replicate blk_B to datanode(s) DN2, DN3
> WARN org.apache.hadoop.fs.FSNamesystem: Inconsistent size for block blk_B reported from DN2  current size is 134217728 reported size is 134205440
> WARN org.apache.hadoop.fs.FSNamesystem: Deleting block blk_B from DN2
> INFO org.apache.hadoop.dfs.StateChange: DIR* NameSystem.invalidateBlock: blk_B on DN2
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.delete: blk_B is added to invalidSet of DN2
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: DN2 is added to blk_B size 134217728
> WARN org.apache.hadoop.fs.FSNamesystem: Inconsistent size for block blk_-B reported from DN3 current size is 134217728 reported size is 134205440
> WARN org.apache.hadoop.fs.FSNamesystem: Deleting block blk_B from DN3
> INFO org.apache.hadoop.dfs.StateChange: DIR* NameSystem.invalidateBlock: blk_B on DN3
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.delete: blk_B is added to invalidSet of DN3
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: DN3 is added to blk_B size 134217728
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask DN1 to replicate blk_B  to datanode(s) DN4, DN5
> ...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-4692) Namenode in infinite loop for replicating/deleting corrupted block

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-4692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang updated HADOOP-4692:
----------------------------------

    Attachment: truncateBlockReplication.patch

A patch is attached for review.

>  Namenode in infinite loop for replicating/deleting corrupted block
> -------------------------------------------------------------------
>
>                 Key: HADOOP-4692
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4692
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.18.0
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.20.0
>
>         Attachments: namenode_inconsistent_size.patch, truncateBlockReplication.patch
>
>
> Our cluster has an under-replicated block with only one replica, assuming its block id is B. NameNode log shows that NameNode is in an infinite loop replicating/deleting the block.
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask DN1 to replicate blk_B to datanode(s) DN2, DN3
> WARN org.apache.hadoop.fs.FSNamesystem: Inconsistent size for block blk_B reported from DN2  current size is 134217728 reported size is 134205440
> WARN org.apache.hadoop.fs.FSNamesystem: Deleting block blk_B from DN2
> INFO org.apache.hadoop.dfs.StateChange: DIR* NameSystem.invalidateBlock: blk_B on DN2
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.delete: blk_B is added to invalidSet of DN2
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: DN2 is added to blk_B size 134217728
> WARN org.apache.hadoop.fs.FSNamesystem: Inconsistent size for block blk_-B reported from DN3 current size is 134217728 reported size is 134205440
> WARN org.apache.hadoop.fs.FSNamesystem: Deleting block blk_B from DN3
> INFO org.apache.hadoop.dfs.StateChange: DIR* NameSystem.invalidateBlock: blk_B on DN3
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.delete: blk_B is added to invalidSet of DN3
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: DN3 is added to blk_B size 134217728
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask DN1 to replicate blk_B  to datanode(s) DN4, DN5
> ...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-4692) Namenode in infinite loop for replicating/deleting corrupted block

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-4692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang updated HADOOP-4692:
----------------------------------

    Attachment: mismatchBlockReplication1.patch

Compared to the last patch, this patch has two changes:
1. Remove the check isUnderConstruction because isValid return false if a block is still under construction.
2. If a block's on-disk block size is bigger than the NN recorded length, do not mark it as corrupt. Instead copy the number of bytes that NN asks for.

Change 2 is being cautious. When I work on HADOOP-5133, I realize that in the current trunk, a block's length may not be finalized even when a file is closed. So marking a block to be corrupt using NN recorded length is too dangerous.

>  Namenode in infinite loop for replicating/deleting corrupted block
> -------------------------------------------------------------------
>
>                 Key: HADOOP-4692
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4692
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.18.0
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.20.0
>
>         Attachments: mismatchBlockReplication.patch, mismatchBlockReplication1.patch, namenode_inconsistent_size.patch, truncateBlockReplication.patch
>
>
> Our cluster has an under-replicated block with only one replica, assuming its block id is B. NameNode log shows that NameNode is in an infinite loop replicating/deleting the block.
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask DN1 to replicate blk_B to datanode(s) DN2, DN3
> WARN org.apache.hadoop.fs.FSNamesystem: Inconsistent size for block blk_B reported from DN2  current size is 134217728 reported size is 134205440
> WARN org.apache.hadoop.fs.FSNamesystem: Deleting block blk_B from DN2
> INFO org.apache.hadoop.dfs.StateChange: DIR* NameSystem.invalidateBlock: blk_B on DN2
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.delete: blk_B is added to invalidSet of DN2
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: DN2 is added to blk_B size 134217728
> WARN org.apache.hadoop.fs.FSNamesystem: Inconsistent size for block blk_-B reported from DN3 current size is 134217728 reported size is 134205440
> WARN org.apache.hadoop.fs.FSNamesystem: Deleting block blk_B from DN3
> INFO org.apache.hadoop.dfs.StateChange: DIR* NameSystem.invalidateBlock: blk_B on DN3
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.delete: blk_B is added to invalidSet of DN3
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: DN3 is added to blk_B size 134217728
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask DN1 to replicate blk_B  to datanode(s) DN4, DN5
> ...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4692) Namenode in infinite loop for replicating/deleting corrupted block

Posted by "Brian Bockelman (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12657093#action_12657093 ] 

Brian Bockelman commented on HADOOP-4692:
-----------------------------------------

Bah - the approach works to trigger verification, but the verification doesn't catch the fact that there's too little data (the metadata is computed for the truncated block.  In fact, the block does verify just fine!

>  Namenode in infinite loop for replicating/deleting corrupted block
> -------------------------------------------------------------------
>
>                 Key: HADOOP-4692
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4692
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.18.0
>            Reporter: Hairong Kuang
>             Fix For: 0.20.0
>
>         Attachments: namenode_inconsistent_size.patch
>
>
> Our cluster has an under-replicated block with only one replica, assuming its block id is B. NameNode log shows that NameNode is in an infinite loop replicating/deleting the block.
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask DN1 to replicate blk_B to datanode(s) DN2, DN3
> WARN org.apache.hadoop.fs.FSNamesystem: Inconsistent size for block blk_B reported from DN2  current size is 134217728 reported size is 134205440
> WARN org.apache.hadoop.fs.FSNamesystem: Deleting block blk_B from DN2
> INFO org.apache.hadoop.dfs.StateChange: DIR* NameSystem.invalidateBlock: blk_B on DN2
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.delete: blk_B is added to invalidSet of DN2
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: DN2 is added to blk_B size 134217728
> WARN org.apache.hadoop.fs.FSNamesystem: Inconsistent size for block blk_-B reported from DN3 current size is 134217728 reported size is 134205440
> WARN org.apache.hadoop.fs.FSNamesystem: Deleting block blk_B from DN3
> INFO org.apache.hadoop.dfs.StateChange: DIR* NameSystem.invalidateBlock: blk_B on DN3
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.delete: blk_B is added to invalidSet of DN3
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: DN3 is added to blk_B size 134217728
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask DN1 to replicate blk_B  to datanode(s) DN4, DN5
> ...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4692) Namenode in infinite loop for replicating/deleting corrupted block

Posted by "Konstantin Shvachko (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12668140#action_12668140 ] 

Konstantin Shvachko commented on HADOOP-4692:
---------------------------------------------

[See related comment here|https://issues.apache.org/jira/browse/HADOOP-5027#action_12668136]

>  Namenode in infinite loop for replicating/deleting corrupted block
> -------------------------------------------------------------------
>
>                 Key: HADOOP-4692
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4692
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.18.0
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.20.0
>
>         Attachments: mismatchBlockReplication.patch, namenode_inconsistent_size.patch, truncateBlockReplication.patch
>
>
> Our cluster has an under-replicated block with only one replica, assuming its block id is B. NameNode log shows that NameNode is in an infinite loop replicating/deleting the block.
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask DN1 to replicate blk_B to datanode(s) DN2, DN3
> WARN org.apache.hadoop.fs.FSNamesystem: Inconsistent size for block blk_B reported from DN2  current size is 134217728 reported size is 134205440
> WARN org.apache.hadoop.fs.FSNamesystem: Deleting block blk_B from DN2
> INFO org.apache.hadoop.dfs.StateChange: DIR* NameSystem.invalidateBlock: blk_B on DN2
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.delete: blk_B is added to invalidSet of DN2
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: DN2 is added to blk_B size 134217728
> WARN org.apache.hadoop.fs.FSNamesystem: Inconsistent size for block blk_-B reported from DN3 current size is 134217728 reported size is 134205440
> WARN org.apache.hadoop.fs.FSNamesystem: Deleting block blk_B from DN3
> INFO org.apache.hadoop.dfs.StateChange: DIR* NameSystem.invalidateBlock: blk_B on DN3
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.delete: blk_B is added to invalidSet of DN3
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: DN3 is added to blk_B size 134217728
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask DN1 to replicate blk_B  to datanode(s) DN4, DN5
> ...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4692) Namenode in infinite loop for replicating/deleting corrupted block

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12664715#action_12664715 ] 

Hairong Kuang commented on HADOOP-4692:
---------------------------------------

Ok if HADOOP-5027 makes sure that  blocks under construction do not add to blocksMap, I will treat on-disk blocks whose length is inconsistent with NN recorded length as corrupt and datanodes stop replicating them.

>  Namenode in infinite loop for replicating/deleting corrupted block
> -------------------------------------------------------------------
>
>                 Key: HADOOP-4692
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4692
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.18.0
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.20.0
>
>         Attachments: namenode_inconsistent_size.patch, truncateBlockReplication.patch
>
>
> Our cluster has an under-replicated block with only one replica, assuming its block id is B. NameNode log shows that NameNode is in an infinite loop replicating/deleting the block.
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask DN1 to replicate blk_B to datanode(s) DN2, DN3
> WARN org.apache.hadoop.fs.FSNamesystem: Inconsistent size for block blk_B reported from DN2  current size is 134217728 reported size is 134205440
> WARN org.apache.hadoop.fs.FSNamesystem: Deleting block blk_B from DN2
> INFO org.apache.hadoop.dfs.StateChange: DIR* NameSystem.invalidateBlock: blk_B on DN2
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.delete: blk_B is added to invalidSet of DN2
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: DN2 is added to blk_B size 134217728
> WARN org.apache.hadoop.fs.FSNamesystem: Inconsistent size for block blk_-B reported from DN3 current size is 134217728 reported size is 134205440
> WARN org.apache.hadoop.fs.FSNamesystem: Deleting block blk_B from DN3
> INFO org.apache.hadoop.dfs.StateChange: DIR* NameSystem.invalidateBlock: blk_B on DN3
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.delete: blk_B is added to invalidSet of DN3
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: DN3 is added to blk_B size 134217728
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask DN1 to replicate blk_B  to datanode(s) DN4, DN5
> ...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (HADOOP-4692) Namenode in infinite loop for replicating/deleting corrupted block

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-4692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hairong Kuang reassigned HADOOP-4692:
-------------------------------------

    Assignee: Hairong Kuang

>  Namenode in infinite loop for replicating/deleting corrupted block
> -------------------------------------------------------------------
>
>                 Key: HADOOP-4692
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4692
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.18.0
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.20.0
>
>         Attachments: namenode_inconsistent_size.patch
>
>
> Our cluster has an under-replicated block with only one replica, assuming its block id is B. NameNode log shows that NameNode is in an infinite loop replicating/deleting the block.
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask DN1 to replicate blk_B to datanode(s) DN2, DN3
> WARN org.apache.hadoop.fs.FSNamesystem: Inconsistent size for block blk_B reported from DN2  current size is 134217728 reported size is 134205440
> WARN org.apache.hadoop.fs.FSNamesystem: Deleting block blk_B from DN2
> INFO org.apache.hadoop.dfs.StateChange: DIR* NameSystem.invalidateBlock: blk_B on DN2
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.delete: blk_B is added to invalidSet of DN2
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: DN2 is added to blk_B size 134217728
> WARN org.apache.hadoop.fs.FSNamesystem: Inconsistent size for block blk_-B reported from DN3 current size is 134217728 reported size is 134205440
> WARN org.apache.hadoop.fs.FSNamesystem: Deleting block blk_B from DN3
> INFO org.apache.hadoop.dfs.StateChange: DIR* NameSystem.invalidateBlock: blk_B on DN3
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.delete: blk_B is added to invalidSet of DN3
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: DN3 is added to blk_B size 134217728
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask DN1 to replicate blk_B  to datanode(s) DN4, DN5
> ...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4692) Namenode in infinite loop for replicating/deleting corrupted block

Posted by "Brian Bockelman (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12657073#action_12657073 ] 

Brian Bockelman commented on HADOOP-4692:
-----------------------------------------

I'll be able to work on an updated patch tomorrow -- for now, this approach appears to be working.

>  Namenode in infinite loop for replicating/deleting corrupted block
> -------------------------------------------------------------------
>
>                 Key: HADOOP-4692
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4692
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.18.0
>            Reporter: Hairong Kuang
>             Fix For: 0.20.0
>
>         Attachments: namenode_inconsistent_size.patch
>
>
> Our cluster has an under-replicated block with only one replica, assuming its block id is B. NameNode log shows that NameNode is in an infinite loop replicating/deleting the block.
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask DN1 to replicate blk_B to datanode(s) DN2, DN3
> WARN org.apache.hadoop.fs.FSNamesystem: Inconsistent size for block blk_B reported from DN2  current size is 134217728 reported size is 134205440
> WARN org.apache.hadoop.fs.FSNamesystem: Deleting block blk_B from DN2
> INFO org.apache.hadoop.dfs.StateChange: DIR* NameSystem.invalidateBlock: blk_B on DN2
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.delete: blk_B is added to invalidSet of DN2
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: DN2 is added to blk_B size 134217728
> WARN org.apache.hadoop.fs.FSNamesystem: Inconsistent size for block blk_-B reported from DN3 current size is 134217728 reported size is 134205440
> WARN org.apache.hadoop.fs.FSNamesystem: Deleting block blk_B from DN3
> INFO org.apache.hadoop.dfs.StateChange: DIR* NameSystem.invalidateBlock: blk_B on DN3
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.delete: blk_B is added to invalidSet of DN3
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: DN3 is added to blk_B size 134217728
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask DN1 to replicate blk_B  to datanode(s) DN4, DN5
> ...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4692) Namenode in infinite loop for replicating/deleting corrupted block

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12664293#action_12664293 ] 

dhruba borthakur commented on HADOOP-4692:
------------------------------------------

Previously scheduled replication requests willl complete and the new destination datanode will send a blockReceived message to the NN. In the meantime, if the file has been opened for "append", then the generation stamp on the namenode should have been bumped. If the blockReceived arrived at the NN after the generation stamp has been bumped, then the blockReceived will not be able to find this block in the blocksMap.

i think this should not cause any issues. Any race condition that I might have missed?

>  Namenode in infinite loop for replicating/deleting corrupted block
> -------------------------------------------------------------------
>
>                 Key: HADOOP-4692
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4692
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.18.0
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.20.0
>
>         Attachments: namenode_inconsistent_size.patch, truncateBlockReplication.patch
>
>
> Our cluster has an under-replicated block with only one replica, assuming its block id is B. NameNode log shows that NameNode is in an infinite loop replicating/deleting the block.
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask DN1 to replicate blk_B to datanode(s) DN2, DN3
> WARN org.apache.hadoop.fs.FSNamesystem: Inconsistent size for block blk_B reported from DN2  current size is 134217728 reported size is 134205440
> WARN org.apache.hadoop.fs.FSNamesystem: Deleting block blk_B from DN2
> INFO org.apache.hadoop.dfs.StateChange: DIR* NameSystem.invalidateBlock: blk_B on DN2
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.delete: blk_B is added to invalidSet of DN2
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: DN2 is added to blk_B size 134217728
> WARN org.apache.hadoop.fs.FSNamesystem: Inconsistent size for block blk_-B reported from DN3 current size is 134217728 reported size is 134205440
> WARN org.apache.hadoop.fs.FSNamesystem: Deleting block blk_B from DN3
> INFO org.apache.hadoop.dfs.StateChange: DIR* NameSystem.invalidateBlock: blk_B on DN3
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.delete: blk_B is added to invalidSet of DN3
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: DN3 is added to blk_B size 134217728
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask DN1 to replicate blk_B  to datanode(s) DN4, DN5
> ...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4692) Namenode in infinite loop for replicating/deleting corrupted block

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12663172#action_12663172 ] 

Raghu Angadi commented on HADOOP-4692:
--------------------------------------

> BiockSender needs to know if the block reading is for block transfer or not by checking if the client name before throwing TruncateBlockException. Would this be OK? 

I don't think so. Right now it always throws IOException. We just needs to change the exception so that higher levels can distinguish. 

> Another question is what should BlockSender do if the on-disk block length is longer than the NN recorded length? Currently block replication only copies the number of bytes recorded by NN. Is this a good idea?

Copying only the bytes requested by NN is ok (as far as NN is concerned).  Similar to previous comment, I don't think BlockSender should worry about it, but some higher level in DataNode... I am +0 on fixing "extra data" issue. But if we want to, DataTransfer thread could check for the right size before even creating a BlockSender. 

>  Namenode in infinite loop for replicating/deleting corrupted block
> -------------------------------------------------------------------
>
>                 Key: HADOOP-4692
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4692
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.18.0
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.20.0
>
>         Attachments: namenode_inconsistent_size.patch, truncateBlockReplication.patch
>
>
> Our cluster has an under-replicated block with only one replica, assuming its block id is B. NameNode log shows that NameNode is in an infinite loop replicating/deleting the block.
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask DN1 to replicate blk_B to datanode(s) DN2, DN3
> WARN org.apache.hadoop.fs.FSNamesystem: Inconsistent size for block blk_B reported from DN2  current size is 134217728 reported size is 134205440
> WARN org.apache.hadoop.fs.FSNamesystem: Deleting block blk_B from DN2
> INFO org.apache.hadoop.dfs.StateChange: DIR* NameSystem.invalidateBlock: blk_B on DN2
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.delete: blk_B is added to invalidSet of DN2
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: DN2 is added to blk_B size 134217728
> WARN org.apache.hadoop.fs.FSNamesystem: Inconsistent size for block blk_-B reported from DN3 current size is 134217728 reported size is 134205440
> WARN org.apache.hadoop.fs.FSNamesystem: Deleting block blk_B from DN3
> INFO org.apache.hadoop.dfs.StateChange: DIR* NameSystem.invalidateBlock: blk_B on DN3
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.delete: blk_B is added to invalidSet of DN3
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: DN3 is added to blk_B size 134217728
> INFO org.apache.hadoop.dfs.StateChange: BLOCK* ask DN1 to replicate blk_B  to datanode(s) DN4, DN5
> ...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.