You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Dick King (JIRA)" <ji...@apache.org> on 2006/11/16 20:23:37 UTC

[jira] Created: (HADOOP-731) Sometimes when a dfs file is accessed and one copy has a checksum error the I/O command fails, even if another copy is alright.

Sometimes when a dfs file is accessed and one copy has a checksum error the I/O command fails, even if another copy is alright.
-------------------------------------------------------------------------------------------------------------------------------

                 Key: HADOOP-731
                 URL: http://issues.apache.org/jira/browse/HADOOP-731
             Project: Hadoop
          Issue Type: Bug
          Components: dfs
    Affects Versions: 0.7.2
            Reporter: Dick King


for a particular file [alas, the file no longer exists -- I had to progress]  

    $dfs -cp foo bar        

and

    $dfs -get foo local

failed on a checksum error.  The dfs browser's download function retrieved the file, so either that function doesn't check, or more likely the download function got a different copy.

When a checksum fails on one copy of a file that is redundantly stored, I would prefer that dfs try a different copy, mark the bad one as not existing [which should induce a fresh copy being made from one of the good copies eventually], and make the call continue to work and deliver bytes.

Ideally, if all copies have checksum errors but it's possible to piece together a good copy I would like that to be done.

-dk


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (HADOOP-731) Sometimes when a dfs file is accessed and one copy has a checksum error the I/O command fails, even if another copy is alright.

Posted by "Wendy Chien (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Wendy Chien updated HADOOP-731:
-------------------------------

    Status: Patch Available  (was: Reopened)

> Sometimes when a dfs file is accessed and one copy has a checksum error the I/O command fails, even if another copy is alright.
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-731
>                 URL: https://issues.apache.org/jira/browse/HADOOP-731
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.7.2
>            Reporter: Dick King
>         Assigned To: Wendy Chien
>         Attachments: hadoop-731-7.patch
>
>
> for a particular file [alas, the file no longer exists -- I had to progress]  
>     $dfs -cp foo bar        
> and
>     $dfs -get foo local
> failed on a checksum error.  The dfs browser's download function retrieved the file, so either that function doesn't check, or more likely the download function got a different copy.
> When a checksum fails on one copy of a file that is redundantly stored, I would prefer that dfs try a different copy, mark the bad one as not existing [which should induce a fresh copy being made from one of the good copies eventually], and make the call continue to work and deliver bytes.
> Ideally, if all copies have checksum errors but it's possible to piece together a good copy I would like that to be done.
> -dk

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Reopened: (HADOOP-731) Sometimes when a dfs file is accessed and one copy has a checksum error the I/O command fails, even if another copy is alright.

Posted by "Sameer Paranjpye (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sameer Paranjpye reopened HADOOP-731:
-------------------------------------

      Assignee: Wendy Chien  (was: Sameer Paranjpye)

> Sometimes when a dfs file is accessed and one copy has a checksum error the I/O command fails, even if another copy is alright.
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-731
>                 URL: https://issues.apache.org/jira/browse/HADOOP-731
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.7.2
>            Reporter: Dick King
>         Assigned To: Wendy Chien
>
> for a particular file [alas, the file no longer exists -- I had to progress]  
>     $dfs -cp foo bar        
> and
>     $dfs -get foo local
> failed on a checksum error.  The dfs browser's download function retrieved the file, so either that function doesn't check, or more likely the download function got a different copy.
> When a checksum fails on one copy of a file that is redundantly stored, I would prefer that dfs try a different copy, mark the bad one as not existing [which should induce a fresh copy being made from one of the good copies eventually], and make the call continue to work and deliver bytes.
> Ideally, if all copies have checksum errors but it's possible to piece together a good copy I would like that to be done.
> -dk

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (HADOOP-731) Sometimes when a dfs file is accessed and one copy has a checksum error the I/O command fails, even if another copy is alright.

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting updated HADOOP-731:
--------------------------------

       Resolution: Fixed
    Fix Version/s: 0.11.0
           Status: Resolved  (was: Patch Available)

I just committed this.  Thanks, Wendy!

> Sometimes when a dfs file is accessed and one copy has a checksum error the I/O command fails, even if another copy is alright.
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-731
>                 URL: https://issues.apache.org/jira/browse/HADOOP-731
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.7.2
>            Reporter: Dick King
>         Assigned To: Wendy Chien
>             Fix For: 0.11.0
>
>         Attachments: hadoop-731-7.patch
>
>
> for a particular file [alas, the file no longer exists -- I had to progress]  
>     $dfs -cp foo bar        
> and
>     $dfs -get foo local
> failed on a checksum error.  The dfs browser's download function retrieved the file, so either that function doesn't check, or more likely the download function got a different copy.
> When a checksum fails on one copy of a file that is redundantly stored, I would prefer that dfs try a different copy, mark the bad one as not existing [which should induce a fresh copy being made from one of the good copies eventually], and make the call continue to work and deliver bytes.
> Ideally, if all copies have checksum errors but it's possible to piece together a good copy I would like that to be done.
> -dk

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-731) Sometimes when a dfs file is accessed and one copy has a checksum error the I/O command fails, even if another copy is alright.

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/HADOOP-731?page=comments#action_12451394 ] 
            
Hairong Kuang commented on HADOOP-731:
--------------------------------------

I feel that a patch to http://issues.apache.org/jira/browse/HADOOP-698 should also fix this problem.

> Sometimes when a dfs file is accessed and one copy has a checksum error the I/O command fails, even if another copy is alright.
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-731
>                 URL: http://issues.apache.org/jira/browse/HADOOP-731
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.7.2
>            Reporter: Dick King
>
> for a particular file [alas, the file no longer exists -- I had to progress]  
>     $dfs -cp foo bar        
> and
>     $dfs -get foo local
> failed on a checksum error.  The dfs browser's download function retrieved the file, so either that function doesn't check, or more likely the download function got a different copy.
> When a checksum fails on one copy of a file that is redundantly stored, I would prefer that dfs try a different copy, mark the bad one as not existing [which should induce a fresh copy being made from one of the good copies eventually], and make the call continue to work and deliver bytes.
> Ideally, if all copies have checksum errors but it's possible to piece together a good copy I would like that to be done.
> -dk

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

RE: [jira] Resolved: (HADOOP-731) Sometimes when a dfs file is accessed and one copy has a checksum error the I/O command fails, even if another copy is alright.

Posted by Hairong Kuang <ha...@yahoo-inc.com>.

A retry needs a close of current data/crc streams, an open of new data/crc
streams, and a seek. This would do. Then we need to add this to the
description of HADOOP-855.

Hairong

-----Original Message-----
From: Sameer Paranjpye [mailto:sameerp@yahoo-inc.com] 
Sent: Thursday, January 04, 2007 2:51 PM
To: hadoop-dev@lucene.apache.org
Subject: Re: [jira] Resolved: (HADOOP-731) Sometimes when a dfs file is
accessed and one copy has a checksum error the I/O command fails, even if
another copy is alright.

Shouldn't the fix for HADOOP-855 also include a retry on a different
replica? That was my understanding...

Hairong Kuang wrote:
> I feel that HADOOP-731 is not a duplicate of HADOOP-855. The proposal 
> to
> HADOOP-855 is to report to the namenode to delete the corrupted data 
> block/checksum block. The solution helps the next read get the correct 
> data, but the current read still throws a checksum error and thus 
> fails the cp/get operation that calls read.
> 
> Hairong
> 
> -----Original Message-----
> From: Sameer Paranjpye (JIRA) [mailto:jira@apache.org]
> Sent: Thursday, January 04, 2007 1:45 PM
> To: hadoop-dev@lucene.apache.org
> Subject: [jira] Resolved: (HADOOP-731) Sometimes when a dfs file is 
> accessed and one copy has a checksum error the I/O command fails, even 
> if another copy is alright.
> 
> 
>      [
> https://issues.apache.org/jira/browse/HADOOP-731?page=com.atlassian.ji
> ra.plu gin.system.issuetabpanels:all-tabpanel ]
> 
> Sameer Paranjpye resolved HADOOP-731.
> -------------------------------------
> 
>     Resolution: Duplicate
> 
> Duplicated in HADOOP-855
> 
>> Sometimes when a dfs file is accessed and one copy has a checksum 
>> error
> the I/O command fails, even if another copy is alright.
>> ---------------------------------------------------------------------
>> -
>> ---------------------------------------------------------
>>
>>                 Key: HADOOP-731
>>                 URL: https://issues.apache.org/jira/browse/HADOOP-731
>>             Project: Hadoop
>>          Issue Type: Bug
>>          Components: dfs
>>    Affects Versions: 0.7.2
>>            Reporter: Dick King
>>         Assigned To: Sameer Paranjpye
>>
>> for a particular file [alas, the file no longer exists -- I had to
> progress]
>>     $dfs -cp foo bar        
>> and
>>     $dfs -get foo local
>> failed on a checksum error.  The dfs browser's download function 
>> retrieved
> the file, so either that function doesn't check, or more likely the 
> download function got a different copy.
>> When a checksum fails on one copy of a file that is redundantly 
>> stored, I
> would prefer that dfs try a different copy, mark the bad one as not 
> existing [which should induce a fresh copy being made from one of the 
> good copies eventually], and make the call continue to work and deliver
bytes.
>> Ideally, if all copies have checksum errors but it's possible to 
>> piece
> together a good copy I would like that to be done.
>> -dk
> 
> --
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of the administrators:
> https://issues.apache.org/jira/secure/Administrators.jspa
> -
> For more information on JIRA, see: 
> http://www.atlassian.com/software/jira
> 
>         
> 
>

Re: [jira] Resolved: (HADOOP-731) Sometimes when a dfs file is accessed and one copy has a checksum error the I/O command fails, even if another copy is alright.

Posted by Sameer Paranjpye <sa...@yahoo-inc.com>.

Shouldn't the fix for HADOOP-855 also include a retry on a different 
replica? That was my understanding...

Hairong Kuang wrote:
> I feel that HADOOP-731 is not a duplicate of HADOOP-855. The proposal to
> HADOOP-855 is to report to the namenode to delete the corrupted data
> block/checksum block. The solution helps the next read get the correct data,
> but the current read still throws a checksum error and thus fails the cp/get
> operation that calls read.
> 
> Hairong
> 
> -----Original Message-----
> From: Sameer Paranjpye (JIRA) [mailto:jira@apache.org] 
> Sent: Thursday, January 04, 2007 1:45 PM
> To: hadoop-dev@lucene.apache.org
> Subject: [jira] Resolved: (HADOOP-731) Sometimes when a dfs file is accessed
> and one copy has a checksum error the I/O command fails, even if another
> copy is alright.
> 
> 
>      [
> https://issues.apache.org/jira/browse/HADOOP-731?page=com.atlassian.jira.plu
> gin.system.issuetabpanels:all-tabpanel ]
> 
> Sameer Paranjpye resolved HADOOP-731.
> -------------------------------------
> 
>     Resolution: Duplicate
> 
> Duplicated in HADOOP-855
> 
>> Sometimes when a dfs file is accessed and one copy has a checksum error
> the I/O command fails, even if another copy is alright.
>> ----------------------------------------------------------------------
>> ---------------------------------------------------------
>>
>>                 Key: HADOOP-731
>>                 URL: https://issues.apache.org/jira/browse/HADOOP-731
>>             Project: Hadoop
>>          Issue Type: Bug
>>          Components: dfs
>>    Affects Versions: 0.7.2
>>            Reporter: Dick King
>>         Assigned To: Sameer Paranjpye
>>
>> for a particular file [alas, the file no longer exists -- I had to
> progress]  
>>     $dfs -cp foo bar        
>> and
>>     $dfs -get foo local
>> failed on a checksum error.  The dfs browser's download function retrieved
> the file, so either that function doesn't check, or more likely the download
> function got a different copy.
>> When a checksum fails on one copy of a file that is redundantly stored, I
> would prefer that dfs try a different copy, mark the bad one as not existing
> [which should induce a fresh copy being made from one of the good copies
> eventually], and make the call continue to work and deliver bytes.
>> Ideally, if all copies have checksum errors but it's possible to piece
> together a good copy I would like that to be done.
>> -dk
> 
> --
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of the administrators:
> https://issues.apache.org/jira/secure/Administrators.jspa
> -
> For more information on JIRA, see: http://www.atlassian.com/software/jira
> 
>         
> 
>

RE: [jira] Resolved: (HADOOP-731) Sometimes when a dfs file is accessed and one copy has a checksum error the I/O command fails, even if another copy is alright.

Posted by Hairong Kuang <ha...@yahoo-inc.com>.

I feel that HADOOP-731 is not a duplicate of HADOOP-855. The proposal to
HADOOP-855 is to report to the namenode to delete the corrupted data
block/checksum block. The solution helps the next read get the correct data,
but the current read still throws a checksum error and thus fails the cp/get
operation that calls read.

Hairong

-----Original Message-----
From: Sameer Paranjpye (JIRA) [mailto:jira@apache.org] 
Sent: Thursday, January 04, 2007 1:45 PM
To: hadoop-dev@lucene.apache.org
Subject: [jira] Resolved: (HADOOP-731) Sometimes when a dfs file is accessed
and one copy has a checksum error the I/O command fails, even if another
copy is alright.


     [
https://issues.apache.org/jira/browse/HADOOP-731?page=com.atlassian.jira.plu
gin.system.issuetabpanels:all-tabpanel ]

Sameer Paranjpye resolved HADOOP-731.
-------------------------------------

    Resolution: Duplicate

Duplicated in HADOOP-855

> Sometimes when a dfs file is accessed and one copy has a checksum error
the I/O command fails, even if another copy is alright.
> ----------------------------------------------------------------------
> ---------------------------------------------------------
>
>                 Key: HADOOP-731
>                 URL: https://issues.apache.org/jira/browse/HADOOP-731
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.7.2
>            Reporter: Dick King
>         Assigned To: Sameer Paranjpye
>
> for a particular file [alas, the file no longer exists -- I had to
progress]  
>     $dfs -cp foo bar        
> and
>     $dfs -get foo local
> failed on a checksum error.  The dfs browser's download function retrieved
the file, so either that function doesn't check, or more likely the download
function got a different copy.
> When a checksum fails on one copy of a file that is redundantly stored, I
would prefer that dfs try a different copy, mark the bad one as not existing
[which should induce a fresh copy being made from one of the good copies
eventually], and make the call continue to work and deliver bytes.
> Ideally, if all copies have checksum errors but it's possible to piece
together a good copy I would like that to be done.
> -dk

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Resolved: (HADOOP-731) Sometimes when a dfs file is accessed and one copy has a checksum error the I/O command fails, even if another copy is alright.

Posted by "Sameer Paranjpye (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sameer Paranjpye resolved HADOOP-731.
-------------------------------------

    Resolution: Duplicate

Duplicated in HADOOP-855

> Sometimes when a dfs file is accessed and one copy has a checksum error the I/O command fails, even if another copy is alright.
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-731
>                 URL: https://issues.apache.org/jira/browse/HADOOP-731
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.7.2
>            Reporter: Dick King
>         Assigned To: Sameer Paranjpye
>
> for a particular file [alas, the file no longer exists -- I had to progress]  
>     $dfs -cp foo bar        
> and
>     $dfs -get foo local
> failed on a checksum error.  The dfs browser's download function retrieved the file, so either that function doesn't check, or more likely the download function got a different copy.
> When a checksum fails on one copy of a file that is redundantly stored, I would prefer that dfs try a different copy, mark the bad one as not existing [which should induce a fresh copy being made from one of the good copies eventually], and make the call continue to work and deliver bytes.
> Ideally, if all copies have checksum errors but it's possible to piece together a good copy I would like that to be done.
> -dk

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (HADOOP-731) Sometimes when a dfs file is accessed and one copy has a checksum error the I/O command fails, even if another copy is alright.

Posted by "Wendy Chien (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Wendy Chien updated HADOOP-731:
-------------------------------

    Attachment: hadoop-731-7.patch

Attached a patch which allows us to continue reading after getting a checksum error by modifying Checker.read to catch ChecksumExceptions thrown by verifySum.  

In Checker.read, if we get a ChecksumException, we seek to a new datanode for both the data stream and the checksum stream (when using dfs, this is a no op for other fs).  If at least one of the datanodes is different from before, we'll retry the read.  

In DFSInputStream, added a new seek method which also requests a datanode other than the current node.

 

> Sometimes when a dfs file is accessed and one copy has a checksum error the I/O command fails, even if another copy is alright.
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-731
>                 URL: https://issues.apache.org/jira/browse/HADOOP-731
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.7.2
>            Reporter: Dick King
>         Assigned To: Wendy Chien
>         Attachments: hadoop-731-7.patch
>
>
> for a particular file [alas, the file no longer exists -- I had to progress]  
>     $dfs -cp foo bar        
> and
>     $dfs -get foo local
> failed on a checksum error.  The dfs browser's download function retrieved the file, so either that function doesn't check, or more likely the download function got a different copy.
> When a checksum fails on one copy of a file that is redundantly stored, I would prefer that dfs try a different copy, mark the bad one as not existing [which should induce a fresh copy being made from one of the good copies eventually], and make the call continue to work and deliver bytes.
> Ideally, if all copies have checksum errors but it's possible to piece together a good copy I would like that to be done.
> -dk

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-731) Sometimes when a dfs file is accessed and one copy has a checksum error the I/O command fails, even if another copy is alright.

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12467157 ] 

Hadoop QA commented on HADOOP-731:
----------------------------------

+1, because http://issues.apache.org/jira/secure/attachment/12349420/hadoop-731-7.patch applied and successfully tested against trunk revision r499156.

> Sometimes when a dfs file is accessed and one copy has a checksum error the I/O command fails, even if another copy is alright.
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-731
>                 URL: https://issues.apache.org/jira/browse/HADOOP-731
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.7.2
>            Reporter: Dick King
>         Assigned To: Wendy Chien
>         Attachments: hadoop-731-7.patch
>
>
> for a particular file [alas, the file no longer exists -- I had to progress]  
>     $dfs -cp foo bar        
> and
>     $dfs -get foo local
> failed on a checksum error.  The dfs browser's download function retrieved the file, so either that function doesn't check, or more likely the download function got a different copy.
> When a checksum fails on one copy of a file that is redundantly stored, I would prefer that dfs try a different copy, mark the bad one as not existing [which should induce a fresh copy being made from one of the good copies eventually], and make the call continue to work and deliver bytes.
> Ideally, if all copies have checksum errors but it's possible to piece together a good copy I would like that to be done.
> -dk

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.