You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Yan Zhou (JIRA)" <ji...@apache.org> on 2013/07/19 21:08:51 UTC
[jira] [Commented] (MAPREDUCE-2038) Making reduce tasks locality-aware

    [ https://issues.apache.org/jira/browse/MAPREDUCE-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13713968#comment-13713968 ] 

Yan Zhou commented on MAPREDUCE-2038:
-------------------------------------

For the HBase bulk load case, it actually also applies in a more generic way to many, if not all, distributed standing services like Tez, which are becoming popular in the current trend of low latency or interactive activities. Another factor is the coming HDFS caching. Having the locality-aware reducer scheduling is expected to have big impacts on performances of those "single-shot" map/reduce jobs. In general DAG executions will benefit but the executor scheduling and output replication mechanism, if any, is expected to be more complex.

The 2011 paper seems to maximize locality likelihood in reducer scheduling based upon cluster physical topology, which might be well targeting the above "rack combiner" case. But for a DAG execution, some "external" info related to that execution has to be provided for scheduling.

In the cases of a HBase bulk loader, or a downstream standing executor, a hint in the Partitioner could be heeded for scheduling reducers.
                
> Making reduce tasks locality-aware
> ----------------------------------
>
>                 Key: MAPREDUCE-2038
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2038
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>            Reporter: Hong Tang
>
> Currently Hadoop MapReduce framework does not take into consideration of data locality when it decides to launch reduce tasks. There are several cases where it could become sub-optimal.
> - The map output data for a particular reduce task are not distributed evenly across different racks. This could happen when the job does not have many maps, or when there is heavy skew in map output data.
> - A reduce task may need to access some side file (e.g. Pig fragmented join, or incremental merge of unsorted smaller dataset with an already sorted large dataset). It'd be useful to place reduce tasks based on the location of the side files they need to access.
> This jira is created for the purpose of soliciting ideas on how we can make it better.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira