You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Raghu Angadi (JIRA)" <ji...@apache.org> on 2007/10/02 03:00:56 UTC

[jira] Updated: (HADOOP-1955) Corrupted block replication retries for ever

     [ https://issues.apache.org/jira/browse/HADOOP-1955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Raghu Angadi updated HADOOP-1955:
---------------------------------

    Attachment: HADOOP-1955.patch


When NameNode is not heavily loaded, each time {{computeDatanodeWork()}} goes through the nodes in the same order.  A side affect of this is that it asks the same node to replicate a block each time it tries to replicate it. 

When {{computeDatanodeWork}} runs through all the datanodes, this patch sets start index for next iteration to one after the datanode that was asked to replicate a node in the current iteration. This fixes the the problem seen in this jira (assuming the untried replica was not currpted). Initailly I thought of starting at random index but on a large cluster, it can take very long time before the second node is tried especially if the two source nodes are close to each other.

If all the remaining replicas are corrupted, Namenode will keep on trying. Thats ok, we would like the users to report such cases.

> Corrupted block replication retries for ever
> --------------------------------------------
>
>                 Key: HADOOP-1955
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1955
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.14.1
>            Reporter: Koji Noguchi
>            Assignee: Raghu Angadi
>            Priority: Blocker
>             Fix For: 0.14.2
>
>         Attachments: HADOOP-1955.patch
>
>
> When replicating corrupted block, receiving side rejects the block due to checksum error. Namenode keeps on retrying (with the same source datanode).
> Fsck shows those blocks as under-replicated.
> [Namenode log]
> {noformat} 
> 2007-09-27 02:00:05,273 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.heartbeatCheck: lost heartbeat from 99.2.99.111
> ...
> 2007-09-27 02:01:02,618 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.pendingTransfer: ask 99.9.99.11:9999 to replicate blk_-5925066143536023890 to datanode(s) 99.9.99.37:9999
> 2007-09-27 02:10:03,843 WARN org.apache.hadoop.fs.FSNamesystem: PendingReplicationMonitor timed out block blk_-5925066143536023890
> 2007-09-27 02:10:08,248 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.pendingTransfer: ask 99.9.99.11:9999 to replicate blk_-5925066143536023890 to datanode(s) 99.9.99.35:9999
> 2007-09-27 02:20:03,848 WARN org.apache.hadoop.fs.FSNamesystem: PendingReplicationMonitor timed out block blk_-5925066143536023890
> 2007-09-27 02:20:08,646 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.pendingTransfer: ask 99.9.99.11:9999 to replicate blk_-5925066143536023890 to datanode(s) 99.9.99.19:9999
> (repeats)
> {noformat} 
> [Datanode(sender) 99.9.99.11 log]
> {noformat} 
> 2007-09-27 02:01:04,493 INFO org.apache.hadoop.dfs.DataNode: Starting thread to transfer block blk_-5925066143536023890 to [Lorg.apache.hadoop.dfs.DatanodeInfo;@e58187
> 2007-09-27 02:01:05,153 WARN org.apache.hadoop.dfs.DataNode: Failed to transfer blk_-5925066143536023890 to 74.6.128.37:50010 got java.net.SocketException: Connection reset
>   at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96)
>   at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
>   at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
>   at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
>   at java.io.DataOutputStream.write(DataOutputStream.java:90)
>   at org.apache.hadoop.dfs.DataNode.sendBlock(DataNode.java:1231)
>   at org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:1280)
>   at java.lang.Thread.run(Thread.java:619)
> (repeats)
> {noformat} 
> [Datanode(one of the receiver) 99.9.99.37 log]
> {noformat} 
> 2007-09-27 02:01:05,150 ERROR org.apache.hadoop.dfs.DataNode: DataXceiver: java.io.IOException: Unexpected checksum mismatch while writing blk_-5925066143536023890 from /74.6.128.33:57605
>   at org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:902)
>   at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:727)
>   at java.lang.Thread.run(Thread.java:619)
> {noformat} 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.