You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by 梁李印 <li...@aliyun-inc.com> on 2012/08/04 15:59:06 UTC

答复: 答复: MapReduce shuffle question

The optimization you mentioned is reduce-task locality-aware. Unfortunately,
the current scheduler doesn't consider the reduce task's data locality. So a
reduce task can be scheduled to any node with free slots.
The following jira is discussing this problem:
https://issues.apache.org/jira/browse/MAPREDUCE-2038

Liyin Liang
-----邮件原件-----
发件人: Satheesh Kumar [mailto:nkseam@gmail.com] 
发送时间: 2012年8月4日 1:47
收件人: common-user@hadoop.apache.org
主题: Re: 答复: MapReduce shuffle question

Thank you. One more follow up question:

Are there any optimizations to run map and reduces on the same nodes so
that data is not transported across the network? Generally, how often and
what % of map output is actually transferred over the network to reduce
nodes?

Thanks,
Satheesh

On Fri, Aug 3, 2012 at 7:33 AM, 梁李印 <li...@aliyun-inc.com> wrote:

> When a map task is done, its output is always flushed to the disk and
> merged
> to one file.
> The benefit is that if the reducer is failed, the map need not to re-run.
>
> Liyin Liang
>
> -----邮件原件-----
> 发件人: Satheesh Kumar [mailto:nkseam@gmail.com]
> 发送时间: 2012年8月3日 21:23
> 收件人: common-user@hadoop.apache.org
> 主题: MapReduce shuffle question
>
> Team, can someone please clarify the following question?
>
> In the map phase, the map output is written to the local disk. And in the
> shuffle phase, the map output partitions are transferred to reduce nodes
> using http. So, my question is assuming there are no spills (data set is
> small enough to accommodate this), will the map output be transferred
> directly from memory to the reduce nodes using http without a disk access
> to write the map output? Or, is the map output always flushed to the disk
> before transferred to reduce nodes?
>
> Appreciate the help.
>
> Thanks,
> Satheesh
>
>