You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Chris Douglas (JIRA)" <ji...@apache.org> on 2007/10/17 22:38:50 UTC

[jira] Resolved: (HADOOP-2032) distcp split generation does not work correctly

     [ https://issues.apache.org/jira/browse/HADOOP-2032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas resolved HADOOP-2032.
-----------------------------------

    Resolution: Duplicate

Fixed by HADOOP-2033

> distcp split generation does not work correctly
> -----------------------------------------------
>
>                 Key: HADOOP-2032
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2032
>             Project: Hadoop
>          Issue Type: Bug
>          Components: util
>            Reporter: Runping Qi
>
> With the current implementation, distcp will always assign multiple files to one mapper to copy, no matter how large 
> are the files. This is because the CopyFiles class uses a sequencefile to store the list of files to be copied, 
> one record per file. CopyFile class correctly generates one split per record in the sequence file. However, 
> due to  the way the sequence file record reader works, the minimum unit for splits is the segments between the 
> "syncmarks" in the sequence file. 
> This results in the strange behavior that some mappers get zero records (zero files to copy) even though their 
> split lengths are non-zero, while other mappers get multiple records (multiple filesto copy) from their split (and beyond
> to the next sync mark). 
> When CopyFile class creates the sequencefile, it does try to place a sync mark between splitable segments in the sequence file by calling sync() function of the sequence file record writer. 
> Unfortunately, the sync() function is a no-op for files that are not block compressed.
> Naturally, after I changed the compression type for the sequence file to block compression,
> mappers got the correct records from their splits.
> So a simple fix is to change the compression tye to CompressionType.BLOCK:
> {code}
> // create src list
>     SequenceFile.Writer writer = SequenceFile.createWriter(
>         jobDirectory.getFileSystem(jobConf), jobConf, srcfilelist,
>         LongWritable.class, FilePair.class,
>         SequenceFile.CompressionType.BLOCK);.
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.