You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Umar Javed <um...@gmail.com> on 2013/11/11 06:38:52 UTC

mapping of shuffle outputs to reduce tasks

I was wondering how does the scheduler assign the ShuffledRDD locations to
the reduce tasks? Say that you have 4 reduce tasks, and a number of shuffle
blocks across two machines. Is each reduce task responsible for a subset of
individual keys or a subset of shuffle blocks?

Umar

Re: mapping of shuffle outputs to reduce tasks

Posted by Umar Javed <um...@gmail.com>.
Thanks Josh. I think I need a little more clarification. Say K=2 and R=2,
and you have M*R shuffle blocks. Both executors start at the same time.
What's the initial partition? How many shuffle blocks does each executor
get at the beginning? Also, is this assignment random or does the reduce
task operate on the local shuffle blocks first and only then moves onto the
shuffle blocks at a remote machine?

thanks!
Umar


On Mon, Nov 11, 2013 at 10:54 AM, Josh Rosen <ro...@gmail.com> wrote:

> Let's say that you're running a MapReduce job with *M* map tasks, *R *reduce
> tasks, and *K* machines.  Each map task will produce *R* shuffle outputs
> (so *M*R* shuffle blocks total).  When the reduce phase starts, pending
> reduce tasks are pulled off a queue and scheduled on executors.  Reduce
> tasks aren't assigned to particular machines in advance; they're scheduled
> as executors become free.
>
> If you have more reduce tasks than machines (*R > K)*, then some machines
> will run multiple reduce tasks.  You might want to run more reduce tasks
> than machines to a). limit an individual reduce task's memory requirements,
> or b). adapt to skew and stragglers.  With smaller, more granular reduce
> tasks, slower machines can simply run fewer tasks while the remaining work
> can be divided among the other machines.  The trade-off here is increased
> scheduling overhead and more reduce output partitions, although the
> scheduling overhead may be negligible in many cases and the small
> post-shuffle outputs could be combined using coalesce().
>
>
>
>
> On Mon, Nov 11, 2013 at 8:54 AM, Umar Javed <um...@gmail.com> wrote:
>
>> Say that you have a taskSet of maps, each operating on one Hadoop
>> partition. How does the scheduler decide which mapTask output (i.e., a
>> shuffle block) goes to what reducer? Are the shuffle blocks evenly split
>> among reducers?
>>
>>
>> On Sun, Nov 10, 2013 at 9:50 PM, Aaron Davidson <il...@gmail.com>wrote:
>>
>>> It is responsible for a subset of shuffle blocks. MapTasks split up
>>> their data, creating one shuffle block for every reducer. During the
>>> shuffle phase, the reducer will fetch all shuffle blocks that were intended
>>> for it.
>>>
>>>
>>> On Sun, Nov 10, 2013 at 9:38 PM, Umar Javed <um...@gmail.com>wrote:
>>>
>>>> I was wondering how does the scheduler assign the ShuffledRDD locations
>>>> to the reduce tasks? Say that you have 4 reduce tasks, and a number of
>>>> shuffle blocks across two machines. Is each reduce task responsible for a
>>>> subset of individual keys or a subset of shuffle blocks?
>>>>
>>>> Umar
>>>>
>>>
>>>
>>
>

Re: mapping of shuffle outputs to reduce tasks

Posted by Josh Rosen <ro...@gmail.com>.
Let's say that you're running a MapReduce job with *M* map tasks, *R *reduce
tasks, and *K* machines.  Each map task will produce *R* shuffle outputs
(so *M*R* shuffle blocks total).  When the reduce phase starts, pending
reduce tasks are pulled off a queue and scheduled on executors.  Reduce
tasks aren't assigned to particular machines in advance; they're scheduled
as executors become free.

If you have more reduce tasks than machines (*R > K)*, then some machines
will run multiple reduce tasks.  You might want to run more reduce tasks
than machines to a). limit an individual reduce task's memory requirements,
or b). adapt to skew and stragglers.  With smaller, more granular reduce
tasks, slower machines can simply run fewer tasks while the remaining work
can be divided among the other machines.  The trade-off here is increased
scheduling overhead and more reduce output partitions, although the
scheduling overhead may be negligible in many cases and the small
post-shuffle outputs could be combined using coalesce().




On Mon, Nov 11, 2013 at 8:54 AM, Umar Javed <um...@gmail.com> wrote:

> Say that you have a taskSet of maps, each operating on one Hadoop
> partition. How does the scheduler decide which mapTask output (i.e., a
> shuffle block) goes to what reducer? Are the shuffle blocks evenly split
> among reducers?
>
>
> On Sun, Nov 10, 2013 at 9:50 PM, Aaron Davidson <il...@gmail.com>wrote:
>
>> It is responsible for a subset of shuffle blocks. MapTasks split up their
>> data, creating one shuffle block for every reducer. During the shuffle
>> phase, the reducer will fetch all shuffle blocks that were intended for it.
>>
>>
>> On Sun, Nov 10, 2013 at 9:38 PM, Umar Javed <um...@gmail.com>wrote:
>>
>>> I was wondering how does the scheduler assign the ShuffledRDD locations
>>> to the reduce tasks? Say that you have 4 reduce tasks, and a number of
>>> shuffle blocks across two machines. Is each reduce task responsible for a
>>> subset of individual keys or a subset of shuffle blocks?
>>>
>>> Umar
>>>
>>
>>
>

Re: mapping of shuffle outputs to reduce tasks

Posted by Umar Javed <um...@gmail.com>.
Say that you have a taskSet of maps, each operating on one Hadoop
partition. How does the scheduler decide which mapTask output (i.e., a
shuffle block) goes to what reducer? Are the shuffle blocks evenly split
among reducers?


On Sun, Nov 10, 2013 at 9:50 PM, Aaron Davidson <il...@gmail.com> wrote:

> It is responsible for a subset of shuffle blocks. MapTasks split up their
> data, creating one shuffle block for every reducer. During the shuffle
> phase, the reducer will fetch all shuffle blocks that were intended for it.
>
>
> On Sun, Nov 10, 2013 at 9:38 PM, Umar Javed <um...@gmail.com> wrote:
>
>> I was wondering how does the scheduler assign the ShuffledRDD locations
>> to the reduce tasks? Say that you have 4 reduce tasks, and a number of
>> shuffle blocks across two machines. Is each reduce task responsible for a
>> subset of individual keys or a subset of shuffle blocks?
>>
>> Umar
>>
>
>

Re: mapping of shuffle outputs to reduce tasks

Posted by Aaron Davidson <il...@gmail.com>.
It is responsible for a subset of shuffle blocks. MapTasks split up their
data, creating one shuffle block for every reducer. During the shuffle
phase, the reducer will fetch all shuffle blocks that were intended for it.


On Sun, Nov 10, 2013 at 9:38 PM, Umar Javed <um...@gmail.com> wrote:

> I was wondering how does the scheduler assign the ShuffledRDD locations to
> the reduce tasks? Say that you have 4 reduce tasks, and a number of shuffle
> blocks across two machines. Is each reduce task responsible for a subset of
> individual keys or a subset of shuffle blocks?
>
> Umar
>