You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "lohit vijayarenu (JIRA)" <ji...@apache.org> on 2007/10/31 21:04:50 UTC

[jira] Commented: (HADOOP-2120) dfs -getMerge does not do what it says it does

    [ https://issues.apache.org/jira/browse/HADOOP-2120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12539170 ] 

lohit vijayarenu commented on HADOOP-2120:
------------------------------------------

Visualizing this as a map-reduce job which actually merge/sort into a single file, shouldn't it be available as  a separate package (like distcp, may be)?
 This feature of merging files would be very useful for users who would like to have only one output file. For now they would want to stick to a single reducer and do not want to submit a job with multiple reducers (even thought that is better machine utilization). A generic merge utility with understands the format and merges would be useful? Something motivated from https://issues.apache.org/jira/browse/HADOOP-2113

> dfs -getMerge does not do what it says it does
> ----------------------------------------------
>
>                 Key: HADOOP-2120
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2120
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.14.3
>         Environment: All
>            Reporter: Milind Bhandarkar
>             Fix For: 0.16.0
>
>
> dfs -getMerge, which calls FileUtil.CopyMerge, contains this javadoc:
> {code}
> Get all the files in the directories that match the source file pattern
>    * and merge and sort them to only one file on local fs 
>    * srcf is kept.
> {code}
> However, it only concatenates the set of input files, rather than merging them in sorted order.
> Ideally, the copyMerge should be equivalent to a map-reduce job with IdentityMapper and IdentityReducer with numReducers = 1. However, not having to run this as a map-reduce job has some advantages, since it increases cluster utilization during reduce phase.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.