You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Chris Douglas (JIRA)" <ji...@apache.org> on 2007/10/17 22:38:50 UTC
[jira] Resolved: (HADOOP-2032) distcp split generation does not
work correctly
[ https://issues.apache.org/jira/browse/HADOOP-2032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chris Douglas resolved HADOOP-2032.
-----------------------------------
Resolution: Duplicate
Fixed by HADOOP-2033
> distcp split generation does not work correctly
> -----------------------------------------------
>
> Key: HADOOP-2032
> URL: https://issues.apache.org/jira/browse/HADOOP-2032
> Project: Hadoop
> Issue Type: Bug
> Components: util
> Reporter: Runping Qi
>
> With the current implementation, distcp will always assign multiple files to one mapper to copy, no matter how large
> are the files. This is because the CopyFiles class uses a sequencefile to store the list of files to be copied,
> one record per file. CopyFile class correctly generates one split per record in the sequence file. However,
> due to the way the sequence file record reader works, the minimum unit for splits is the segments between the
> "syncmarks" in the sequence file.
> This results in the strange behavior that some mappers get zero records (zero files to copy) even though their
> split lengths are non-zero, while other mappers get multiple records (multiple filesto copy) from their split (and beyond
> to the next sync mark).
> When CopyFile class creates the sequencefile, it does try to place a sync mark between splitable segments in the sequence file by calling sync() function of the sequence file record writer.
> Unfortunately, the sync() function is a no-op for files that are not block compressed.
> Naturally, after I changed the compression type for the sequence file to block compression,
> mappers got the correct records from their splits.
> So a simple fix is to change the compression tye to CompressionType.BLOCK:
> {code}
> // create src list
> SequenceFile.Writer writer = SequenceFile.createWriter(
> jobDirectory.getFileSystem(jobConf), jobConf, srcfilelist,
> LongWritable.class, FilePair.class,
> SequenceFile.CompressionType.BLOCK);.
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.