You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "lohit vijayarenu (JIRA)" <ji...@apache.org> on 2008/05/15 00:57:55 UTC

[jira] Commented: (HADOOP-3392) Corrupted blocks leading to job failures

    [ https://issues.apache.org/jira/browse/HADOOP-3392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12596973#action_12596973 ] 

lohit vijayarenu commented on HADOOP-3392:
------------------------------------------

HADOOP-2065 introduces a new field in the block to check if it is corrupt or not. A block is considered corrupt, if all of its replicas are corrupt, else it would be filtered out. Now, in the case you described, all (which is one) replicas were corrupt, so, the block would be marked as corrupt. 
HADOOP-3013 has already been opened to list such blocks via fsck command. Now, that HADOOP-2065 is committed, fsck should be able to identify such copies.
On a similar note, when namenode issues request to replicate a block and if it is corrupt, it should be reported to namenode. This should be fixed in HADOOP-3035

> Corrupted blocks leading to job failures
> ----------------------------------------
>
>                 Key: HADOOP-3392
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3392
>             Project: Hadoop Core
>          Issue Type: Improvement
>    Affects Versions: 0.16.0
>            Reporter: Christian Kunz
>
> On one of our clusters we ended up with 11 singly-replicated corrupted blocks (checksum errors) such that jobs were failing because of no live blocks available.
> fsck reports the system as healthy, although it is not.
> I argue that fsck should have an option to check whether under-replicated blocks are okay.
> Even better, the namenode should automatically check under-replicated blocks with repeated replication failures for corruption and list them somewhere on the GUI. And for checksum errors, there should be an option to undo the corruption and recompute the checksums.
> Question: Is it at all probable that two or more replications of a block have checksum errors? If not, then we could reduce the checking to singly-replicated blocks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.