You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Christian Kunz (JIRA)" <ji...@apache.org> on 2008/05/14 23:19:55 UTC

[jira] Created: (HADOOP-3392) Corrupted blocks leading to job failures

Corrupted blocks leading to job failures
----------------------------------------

                 Key: HADOOP-3392
                 URL: https://issues.apache.org/jira/browse/HADOOP-3392
             Project: Hadoop Core
          Issue Type: Improvement
    Affects Versions: 0.16.0
            Reporter: Christian Kunz


On one of our clusters we ended up with 11 singly-replicated corrupted blocks (checksum errors) such that jobs were failing because of no live blocks available.

fsck reports the system as healthy, although it is not.

I argue that fsck should have an option to check whether under-replicated blocks are okay.

Even better, the namenode should automatically check under-replicated blocks with repeated replication failures for corruption and list them somewhere on the GUI. And there should be an option to undo the corruption and recompute the checksums.

Question: Is it at all probable that two or more replications of a block have checksum errors? If not, then we could reduce the checking to singly-replicated blocks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3392) Corrupted blocks leading to job failures

Posted by "Christian Kunz (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12597604#action_12597604 ] 

Christian Kunz commented on HADOOP-3392:
----------------------------------------

I found the original email from Dhruba;

in 0.17.0, according to HADOOP-2063, you can get blocks with crc errors with:
hadoop fs -get -ignoreCrc,

i.e.you could salvage the file containing the block with crc error by getting the file with -ignoreCrc,  removing the file, and putting it back

This might require that any application reading the file might have to deal with corrupted data, but in many cases still better than losing the whole file.

> Corrupted blocks leading to job failures
> ----------------------------------------
>
>                 Key: HADOOP-3392
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3392
>             Project: Hadoop Core
>          Issue Type: Improvement
>    Affects Versions: 0.16.0
>            Reporter: Christian Kunz
>
> On one of our clusters we ended up with 11 singly-replicated corrupted blocks (checksum errors) such that jobs were failing because of no live blocks available.
> fsck reports the system as healthy, although it is not.
> I argue that fsck should have an option to check whether under-replicated blocks are okay.
> Even better, the namenode should automatically check under-replicated blocks with repeated replication failures for corruption and list them somewhere on the GUI. And for checksum errors, there should be an option to undo the corruption and recompute the checksums.
> Question: Is it at all probable that two or more replications of a block have checksum errors? If not, then we could reduce the checking to singly-replicated blocks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (HADOOP-3392) Corrupted blocks leading to job failures

Posted by "Christian Kunz (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12597604#action_12597604 ] 

ckunz edited comment on HADOOP-3392 at 5/16/08 1:37 PM:
-----------------------------------------------------------------

I found the original email from Dhruba;

in 0.17.0, according to HADOOP-2063, you can get blocks with crc errors with:
hadoop fs -get -ignoreCrc,

i.e.you could salvage the file containing the block with crc error by getting the file with -ignoreCrc,  removing the file in dfs, and putting ithe local copy back to dfs.

This might require that any application reading the file might have to deal with corrupted data, but in many cases still better than losing the whole file.

      was (Author: ckunz):
    I found the original email from Dhruba;

in 0.17.0, according to HADOOP-2063, you can get blocks with crc errors with:
hadoop fs -get -ignoreCrc,

i.e.you could salvage the file containing the block with crc error by getting the file with -ignoreCrc,  removing the file, and putting it back

This might require that any application reading the file might have to deal with corrupted data, but in many cases still better than losing the whole file.
  
> Corrupted blocks leading to job failures
> ----------------------------------------
>
>                 Key: HADOOP-3392
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3392
>             Project: Hadoop Core
>          Issue Type: Improvement
>    Affects Versions: 0.16.0
>            Reporter: Christian Kunz
>
> On one of our clusters we ended up with 11 singly-replicated corrupted blocks (checksum errors) such that jobs were failing because of no live blocks available.
> fsck reports the system as healthy, although it is not.
> I argue that fsck should have an option to check whether under-replicated blocks are okay.
> Even better, the namenode should automatically check under-replicated blocks with repeated replication failures for corruption and list them somewhere on the GUI. And for checksum errors, there should be an option to undo the corruption and recompute the checksums.
> Question: Is it at all probable that two or more replications of a block have checksum errors? If not, then we could reduce the checking to singly-replicated blocks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3392) Corrupted blocks leading to job failures

Posted by "lohit vijayarenu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12596973#action_12596973 ] 

lohit vijayarenu commented on HADOOP-3392:
------------------------------------------

HADOOP-2065 introduces a new field in the block to check if it is corrupt or not. A block is considered corrupt, if all of its replicas are corrupt, else it would be filtered out. Now, in the case you described, all (which is one) replicas were corrupt, so, the block would be marked as corrupt. 
HADOOP-3013 has already been opened to list such blocks via fsck command. Now, that HADOOP-2065 is committed, fsck should be able to identify such copies.
On a similar note, when namenode issues request to replicate a block and if it is corrupt, it should be reported to namenode. This should be fixed in HADOOP-3035

> Corrupted blocks leading to job failures
> ----------------------------------------
>
>                 Key: HADOOP-3392
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3392
>             Project: Hadoop Core
>          Issue Type: Improvement
>    Affects Versions: 0.16.0
>            Reporter: Christian Kunz
>
> On one of our clusters we ended up with 11 singly-replicated corrupted blocks (checksum errors) such that jobs were failing because of no live blocks available.
> fsck reports the system as healthy, although it is not.
> I argue that fsck should have an option to check whether under-replicated blocks are okay.
> Even better, the namenode should automatically check under-replicated blocks with repeated replication failures for corruption and list them somewhere on the GUI. And for checksum errors, there should be an option to undo the corruption and recompute the checksums.
> Question: Is it at all probable that two or more replications of a block have checksum errors? If not, then we could reduce the checking to singly-replicated blocks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3392) Corrupted blocks leading to job failures

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12598782#action_12598782 ] 

dhruba borthakur commented on HADOOP-3392:
------------------------------------------

If a block is smaller than its intended size, then just recomputing checksum for that block will not work, especially if the block is in th middle of the file. It will cause the file to have blocks that are variable size, and will break many other peices of file-ssytem functionality.

One way to fix a corrupted file is to retrieve the original file (with -ignoreCrc) option and then copying it back to HDFS. 

> Corrupted blocks leading to job failures
> ----------------------------------------
>
>                 Key: HADOOP-3392
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3392
>             Project: Hadoop Core
>          Issue Type: Improvement
>    Affects Versions: 0.16.0
>            Reporter: Christian Kunz
>
> On one of our clusters we ended up with 11 singly-replicated corrupted blocks (checksum errors) such that jobs were failing because of no live blocks available.
> fsck reports the system as healthy, although it is not.
> I argue that fsck should have an option to check whether under-replicated blocks are okay.
> Even better, the namenode should automatically check under-replicated blocks with repeated replication failures for corruption and list them somewhere on the GUI. And for checksum errors, there should be an option to undo the corruption and recompute the checksums.
> Question: Is it at all probable that two or more replications of a block have checksum errors? If not, then we could reduce the checking to singly-replicated blocks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3392) Corrupted blocks leading to job failures

Posted by "lohit vijayarenu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12597250#action_12597250 ] 

lohit vijayarenu commented on HADOOP-3392:
------------------------------------------

>Is it reasonable to ask for a hadoop command line option to salvage non-truncated blocks with checksum errors? Otherwise, one would have to copy the corrupted blocks to local filesystem (I overheard that this is possible in 0.17, correct?) and put it back into dfs.

Could you expand what exactly salvage should do here? I am not sure if we would be able to get a block using any command, unless you find its locations and go to the datanode to get the actual block file stored.

> Corrupted blocks leading to job failures
> ----------------------------------------
>
>                 Key: HADOOP-3392
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3392
>             Project: Hadoop Core
>          Issue Type: Improvement
>    Affects Versions: 0.16.0
>            Reporter: Christian Kunz
>
> On one of our clusters we ended up with 11 singly-replicated corrupted blocks (checksum errors) such that jobs were failing because of no live blocks available.
> fsck reports the system as healthy, although it is not.
> I argue that fsck should have an option to check whether under-replicated blocks are okay.
> Even better, the namenode should automatically check under-replicated blocks with repeated replication failures for corruption and list them somewhere on the GUI. And for checksum errors, there should be an option to undo the corruption and recompute the checksums.
> Question: Is it at all probable that two or more replications of a block have checksum errors? If not, then we could reduce the checking to singly-replicated blocks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3392) Corrupted blocks leading to job failures

Posted by "Robert Chansler (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Chansler updated HADOOP-3392:
------------------------------------

    Component/s: dfs

> Corrupted blocks leading to job failures
> ----------------------------------------
>
>                 Key: HADOOP-3392
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3392
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.16.0
>            Reporter: Christian Kunz
>
> On one of our clusters we ended up with 11 singly-replicated corrupted blocks (checksum errors) such that jobs were failing because of no live blocks available.
> fsck reports the system as healthy, although it is not.
> I argue that fsck should have an option to check whether under-replicated blocks are okay.
> Even better, the namenode should automatically check under-replicated blocks with repeated replication failures for corruption and list them somewhere on the GUI. And for checksum errors, there should be an option to undo the corruption and recompute the checksums.
> Question: Is it at all probable that two or more replications of a block have checksum errors? If not, then we could reduce the checking to singly-replicated blocks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (HADOOP-3392) Corrupted blocks leading to job failures

Posted by "lohit vijayarenu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12596973#action_12596973 ] 

lohit edited comment on HADOOP-3392 at 5/14/08 4:12 PM:
-------------------------------------------------------------------

- HADOOP-2065 introduces a new field in the block to check if it is corrupt or not. A block is considered corrupt, if all of its replicas are corrupt, else its corrupt replicas would be filtered out and only good (which also included replicas not yet verified) are reported. If all replicas are found to be corrupt, then we do not filter them out, but rather mark it as corrupt block. Now, in the case you described, all (which is one) replicas were corrupt, so, the block would be marked as corrupt. 
- HADOOP-3013 has already been opened to list such blocks via fsck command. Now, that HADOOP-2065 is committed, fsck should be able to identify such copies.
- On a similar note, when namenode issues request to replicate a block and if it is corrupt, it should be reported to namenode. This should be fixed in HADOOP-3035

      was (Author: lohit):
    HADOOP-2065 introduces a new field in the block to check if it is corrupt or not. A block is considered corrupt, if all of its replicas are corrupt, else it would be filtered out. Now, in the case you described, all (which is one) replicas were corrupt, so, the block would be marked as corrupt. 
HADOOP-3013 has already been opened to list such blocks via fsck command. Now, that HADOOP-2065 is committed, fsck should be able to identify such copies.
On a similar note, when namenode issues request to replicate a block and if it is corrupt, it should be reported to namenode. This should be fixed in HADOOP-3035
  
> Corrupted blocks leading to job failures
> ----------------------------------------
>
>                 Key: HADOOP-3392
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3392
>             Project: Hadoop Core
>          Issue Type: Improvement
>    Affects Versions: 0.16.0
>            Reporter: Christian Kunz
>
> On one of our clusters we ended up with 11 singly-replicated corrupted blocks (checksum errors) such that jobs were failing because of no live blocks available.
> fsck reports the system as healthy, although it is not.
> I argue that fsck should have an option to check whether under-replicated blocks are okay.
> Even better, the namenode should automatically check under-replicated blocks with repeated replication failures for corruption and list them somewhere on the GUI. And for checksum errors, there should be an option to undo the corruption and recompute the checksums.
> Question: Is it at all probable that two or more replications of a block have checksum errors? If not, then we could reduce the checking to singly-replicated blocks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3392) Corrupted blocks leading to job failures

Posted by "Christian Kunz (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12596994#action_12596994 ] 

Christian Kunz commented on HADOOP-3392:
----------------------------------------

Okay, looks like everything is taken care of except salvaging.

Is it reasonable to ask for a hadoop command line option to salvage non-truncated blocks with checksum errors? Otherwise, one would have to copy the corrupted blocks to local filesystem (I overheard that this is possible in 0.17, correct?) and put it back into dfs.

> Corrupted blocks leading to job failures
> ----------------------------------------
>
>                 Key: HADOOP-3392
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3392
>             Project: Hadoop Core
>          Issue Type: Improvement
>    Affects Versions: 0.16.0
>            Reporter: Christian Kunz
>
> On one of our clusters we ended up with 11 singly-replicated corrupted blocks (checksum errors) such that jobs were failing because of no live blocks available.
> fsck reports the system as healthy, although it is not.
> I argue that fsck should have an option to check whether under-replicated blocks are okay.
> Even better, the namenode should automatically check under-replicated blocks with repeated replication failures for corruption and list them somewhere on the GUI. And for checksum errors, there should be an option to undo the corruption and recompute the checksums.
> Question: Is it at all probable that two or more replications of a block have checksum errors? If not, then we could reduce the checking to singly-replicated blocks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3392) Corrupted blocks leading to job failures

Posted by "lohit vijayarenu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12598777#action_12598777 ] 

lohit vijayarenu commented on HADOOP-3392:
------------------------------------------

Ok, so the above steps basically re-computes checksum on all blocks of the file. Your suggestion about salvaging (re-computing checksum in this case) for truncated block sounds like a good option to have. 
There could be 2 ways to solve this. Recompute checksum for whole file or do it just for corrupt block. If we do the second option, we might end up having blocks of size less than BLOCK_SIZE in the system. 

> Corrupted blocks leading to job failures
> ----------------------------------------
>
>                 Key: HADOOP-3392
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3392
>             Project: Hadoop Core
>          Issue Type: Improvement
>    Affects Versions: 0.16.0
>            Reporter: Christian Kunz
>
> On one of our clusters we ended up with 11 singly-replicated corrupted blocks (checksum errors) such that jobs were failing because of no live blocks available.
> fsck reports the system as healthy, although it is not.
> I argue that fsck should have an option to check whether under-replicated blocks are okay.
> Even better, the namenode should automatically check under-replicated blocks with repeated replication failures for corruption and list them somewhere on the GUI. And for checksum errors, there should be an option to undo the corruption and recompute the checksums.
> Question: Is it at all probable that two or more replications of a block have checksum errors? If not, then we could reduce the checking to singly-replicated blocks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3392) Corrupted blocks leading to job failures

Posted by "Christian Kunz (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Christian Kunz updated HADOOP-3392:
-----------------------------------

    Description: 
On one of our clusters we ended up with 11 singly-replicated corrupted blocks (checksum errors) such that jobs were failing because of no live blocks available.

fsck reports the system as healthy, although it is not.

I argue that fsck should have an option to check whether under-replicated blocks are okay.

Even better, the namenode should automatically check under-replicated blocks with repeated replication failures for corruption and list them somewhere on the GUI. And for checksum errors, there should be an option to undo the corruption and recompute the checksums.

Question: Is it at all probable that two or more replications of a block have checksum errors? If not, then we could reduce the checking to singly-replicated blocks.

  was:
On one of our clusters we ended up with 11 singly-replicated corrupted blocks (checksum errors) such that jobs were failing because of no live blocks available.

fsck reports the system as healthy, although it is not.

I argue that fsck should have an option to check whether under-replicated blocks are okay.

Even better, the namenode should automatically check under-replicated blocks with repeated replication failures for corruption and list them somewhere on the GUI. And there should be an option to undo the corruption and recompute the checksums.

Question: Is it at all probable that two or more replications of a block have checksum errors? If not, then we could reduce the checking to singly-replicated blocks.


> Corrupted blocks leading to job failures
> ----------------------------------------
>
>                 Key: HADOOP-3392
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3392
>             Project: Hadoop Core
>          Issue Type: Improvement
>    Affects Versions: 0.16.0
>            Reporter: Christian Kunz
>
> On one of our clusters we ended up with 11 singly-replicated corrupted blocks (checksum errors) such that jobs were failing because of no live blocks available.
> fsck reports the system as healthy, although it is not.
> I argue that fsck should have an option to check whether under-replicated blocks are okay.
> Even better, the namenode should automatically check under-replicated blocks with repeated replication failures for corruption and list them somewhere on the GUI. And for checksum errors, there should be an option to undo the corruption and recompute the checksums.
> Question: Is it at all probable that two or more replications of a block have checksum errors? If not, then we could reduce the checking to singly-replicated blocks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.