You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-issues@hadoop.apache.org by "Ben Podgursky (JIRA)" <ji...@apache.org> on 2013/10/18 17:20:47 UTC

[jira] [Commented] (MAPREDUCE-199) Locality hints for Reduce

    [ https://issues.apache.org/jira/browse/MAPREDUCE-199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13799186#comment-13799186 ] 

Ben Podgursky commented on MAPREDUCE-199:
-----------------------------------------

Doesn't seem like there's been any progress on this recently, but this functionality would be really helpful to us (we've been hunting for a way to do exactly this.)

Our use-case is somewhat similar to the HBase one--we have a number of stores which we keep sorted on the same keys and partitioned identically (ex, partitioned into partfiles 0000- 0599).  When we need to join these stores, instead of running a full map + reduce, we can just run a map task for each file which reads in the partfiles for each side of the join.  Since we are reading these stores many times, it saves us a lot of cluster time to only sort the files once.  

These files are each produced by a normal reduce task.  It would be great if we were able to give hadoop a hint that part-0123 of store A and part-0123 of store B should end up on the same host, so any job joining the two files will be reading purely local data.  Ideally we could accomplish this by giving hadoop a hint about where to run each reduce task so we don't have to shuffle the data around later.

> Locality hints for Reduce
> -------------------------
>
>                 Key: MAPREDUCE-199
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-199
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: applicationmaster, mrv2
>            Reporter: Benjamin Reed
>            Assignee: Harsh J
>         Attachments: MAPREDUCE-199.patch, MAPREDUCE-199.patch
>
>
> It would be nice if we could add method to OutputFormat that would allow a job to indicate where a reducer for a given partition should should run. This is similar to the getSplits() method on InputFormat. In our application the reducer is using other data in addition to the map outputs during processing and data accesses could be made more efficient if the JobTracker scheduled the reducers to run on specific hosts.



--
This message was sent by Atlassian JIRA
(v6.1#6144)