You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by Eric Baldeschwieler <er...@yahoo-inc.com> on 2006/03/30 06:42:24 UTC

Re: Help: -copyFromLocal

Interesting.  It would actually be nice to include the CRCs in an  
export, so that you can validate your data when you reload it.  CRCs  
are best if they are kept end to end.

On Mar 29, 2006, at 9:59 AM, Doug Cutting wrote:

> monu.ogbe@richmondinformatics.com wrote:
>> However I'm getting the following error:
>> copyFromLocal: Target /user/root/crawl/crawldb/current/ 
>> part-00000/.data.crc
>> already exists
>
> Please file a bug report.  The problem is that when copyFromLocal  
> enumerates local files it should exclude .crc files, but it does  
> not. This is the listFiles() call on DistributedFileSystem:160.  It  
> should filter this, excluding files that are  
> FileSystem.isChecksumFile().
>
> BTW, as a workaround, it is safe to first remove all of the .crc  
> files, but your files will no longer be checksummed as they are  
> read.  On systems without ECC memory file corruption is not  
> uncommon, but I have seen very little on clusters that have ECC.
>
> Doug


Re: Help: -copyFromLocal

Posted by Doug Cutting <cu...@apache.org>.
Eric Baldeschwieler wrote:
> Interesting.  It would actually be nice to include the CRCs in an  
> export, so that you can validate your data when you reload it.  CRCs  
> are best if they are kept end to end.

CRC's are included in the export.  As the files are read from dfs, the 
CRC's are checked.  As they're written to the local fs new CRCs are 
computed and written.  But then when the local files are listed, 
preparing to write them back to dfs, the CRC files are listed.  So then 
we copy a file back to dfs, we check its CRC on read and generate a new 
CRC on write.  Then we try to explicitly copy the CRC file and get an 
already-exists error.  Not to mention that we'd be generating a .crc 
file for the .crc file.  So the immediate bug is that we're listing .crc 
files from the local FS.  These should be excluded from directory 
listings there, as they are elsewhere.

We could try to copy CRC files rather than re-generate them, but that's 
a separate issue.  Things should work correctly if one lists a 
directory, opens each file, and writes its content to a new file in 
another directory.  That's valid user code using standard public APIs, 
and there's no opportunity in that case to copy CRC files directly.  The 
way to fix this is to not list CRC files in copyFromLocal.  The 
FileSystem API has both listFiles and listFilesRaw methods for this very 
purpose.  But the copyFromLocal code doesn't use these correctly.  It 
doesn't use the Hadoop FileSystem API to access local files, but rather 
the normal Java APIs.  That's the bug.

We could change file copying to copy CRC files w/o re-generating them, 
disabling re-generation in this case.  This would make CRCs more 
end-to-end, since it could catch corruption while in the 4k copying 
buffer.  When we've seen corruption is when very large buffers are used 
(e.g., when sorting), so this is not a likely place for corruption, but 
still a possible one.  And, again, this is separate from the issue reported.

Doug

Re: Help: -copyFromLocal

Posted by Monu Ogbe <mo...@richmondinformatics.com>.
This is now reported as HADOOP-112 in JIRA. 

----- Original Message ----- 
From: "Eric Baldeschwieler" <er...@yahoo-inc.com>
To: <ha...@lucene.apache.org>
Sent: Thursday, March 30, 2006 5:42 AM
Subject: Re: Help: -copyFromLocal


> Interesting.  It would actually be nice to include the CRCs in an  
> export, so that you can validate your data when you reload it.  CRCs  
> are best if they are kept end to end.
> 
> On Mar 29, 2006, at 9:59 AM, Doug Cutting wrote:
> 
>> monu.ogbe@richmondinformatics.com wrote:
>>> However I'm getting the following error:
>>> copyFromLocal: Target /user/root/crawl/crawldb/current/ 
>>> part-00000/.data.crc
>>> already exists
>>
>> Please file a bug report.  The problem is that when copyFromLocal  
>> enumerates local files it should exclude .crc files, but it does  
>> not. This is the listFiles() call on DistributedFileSystem:160.  It  
>> should filter this, excluding files that are  
>> FileSystem.isChecksumFile().
>>
>> BTW, as a workaround, it is safe to first remove all of the .crc  
>> files, but your files will no longer be checksummed as they are  
>> read.  On systems without ECC memory file corruption is not  
>> uncommon, but I have seen very little on clusters that have ECC.
>>
>> Doug
> 
>