You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by Eric Baldeschwieler <er...@yahoo-inc.com> on 2006/03/30 06:42:24 UTC
Re: Help: -copyFromLocal
Interesting. It would actually be nice to include the CRCs in an
export, so that you can validate your data when you reload it. CRCs
are best if they are kept end to end.
On Mar 29, 2006, at 9:59 AM, Doug Cutting wrote:
> monu.ogbe@richmondinformatics.com wrote:
>> However I'm getting the following error:
>> copyFromLocal: Target /user/root/crawl/crawldb/current/
>> part-00000/.data.crc
>> already exists
>
> Please file a bug report. The problem is that when copyFromLocal
> enumerates local files it should exclude .crc files, but it does
> not. This is the listFiles() call on DistributedFileSystem:160. It
> should filter this, excluding files that are
> FileSystem.isChecksumFile().
>
> BTW, as a workaround, it is safe to first remove all of the .crc
> files, but your files will no longer be checksummed as they are
> read. On systems without ECC memory file corruption is not
> uncommon, but I have seen very little on clusters that have ECC.
>
> Doug
Re: Help: -copyFromLocal
Posted by Doug Cutting <cu...@apache.org>.
Eric Baldeschwieler wrote:
> Interesting. It would actually be nice to include the CRCs in an
> export, so that you can validate your data when you reload it. CRCs
> are best if they are kept end to end.
CRC's are included in the export. As the files are read from dfs, the
CRC's are checked. As they're written to the local fs new CRCs are
computed and written. But then when the local files are listed,
preparing to write them back to dfs, the CRC files are listed. So then
we copy a file back to dfs, we check its CRC on read and generate a new
CRC on write. Then we try to explicitly copy the CRC file and get an
already-exists error. Not to mention that we'd be generating a .crc
file for the .crc file. So the immediate bug is that we're listing .crc
files from the local FS. These should be excluded from directory
listings there, as they are elsewhere.
We could try to copy CRC files rather than re-generate them, but that's
a separate issue. Things should work correctly if one lists a
directory, opens each file, and writes its content to a new file in
another directory. That's valid user code using standard public APIs,
and there's no opportunity in that case to copy CRC files directly. The
way to fix this is to not list CRC files in copyFromLocal. The
FileSystem API has both listFiles and listFilesRaw methods for this very
purpose. But the copyFromLocal code doesn't use these correctly. It
doesn't use the Hadoop FileSystem API to access local files, but rather
the normal Java APIs. That's the bug.
We could change file copying to copy CRC files w/o re-generating them,
disabling re-generation in this case. This would make CRCs more
end-to-end, since it could catch corruption while in the 4k copying
buffer. When we've seen corruption is when very large buffers are used
(e.g., when sorting), so this is not a likely place for corruption, but
still a possible one. And, again, this is separate from the issue reported.
Doug
Re: Help: -copyFromLocal
Posted by Monu Ogbe <mo...@richmondinformatics.com>.
This is now reported as HADOOP-112 in JIRA.
----- Original Message -----
From: "Eric Baldeschwieler" <er...@yahoo-inc.com>
To: <ha...@lucene.apache.org>
Sent: Thursday, March 30, 2006 5:42 AM
Subject: Re: Help: -copyFromLocal
> Interesting. It would actually be nice to include the CRCs in an
> export, so that you can validate your data when you reload it. CRCs
> are best if they are kept end to end.
>
> On Mar 29, 2006, at 9:59 AM, Doug Cutting wrote:
>
>> monu.ogbe@richmondinformatics.com wrote:
>>> However I'm getting the following error:
>>> copyFromLocal: Target /user/root/crawl/crawldb/current/
>>> part-00000/.data.crc
>>> already exists
>>
>> Please file a bug report. The problem is that when copyFromLocal
>> enumerates local files it should exclude .crc files, but it does
>> not. This is the listFiles() call on DistributedFileSystem:160. It
>> should filter this, excluding files that are
>> FileSystem.isChecksumFile().
>>
>> BTW, as a workaround, it is safe to first remove all of the .crc
>> files, but your files will no longer be checksummed as they are
>> read. On systems without ECC memory file corruption is not
>> uncommon, but I have seen very little on clusters that have ECC.
>>
>> Doug
>
>