You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Pedro Costa <ps...@gmail.com> on 2010/06/30 22:29:32 UTC

Partition - Distribution of map outputs

Hi,

I'm studyong the partition implementation of MR and I have some questions.

As I understand from what I've read, the partition has the purpose to tell
to each reducer which map output it will have.
For example, if I've 3 split files and 2 reduces defined in my example, on
the map side it wil be produce 3 map outputs (one map per split file) and on
the reduce side, it will be produced 2 part-* files. The part_00000 it will
contains the results of 2 map outputs and the part_00001 will contain the
results of 1 map output.

- My question is, where in the Hadoop MR is set "which Reduce contains which
Map Output"? Is it during the creation of the reduce tasks, or is in another
phase of the MR?

- Can you point me which class does this distribution of map outputs to the
reduce tasks?

Thanks,
-- 
Pedro

Re: Partition - Distribution of map outputs

Posted by Hemanth Yamijala <yh...@gmail.com>.
Pedro,

On Thu, Jul 1, 2010 at 3:37 AM, Pedro Costa <ps...@gmail.com> wrote:
> - I'm running the wordcount example that accepts 3 small txt files as input.
> I assume that there will exist 3 mappers that produce 3 map outputs. One map
> output per txt file, right?

The number of mappers created will depend on the number of
'InputSplits' generated for the program inputs. How to create
InputSplits is customizable. By default for file based inputs, the
splits are computed according to HDFS blocks they are stored as. So,
in the word count example case, there will be one mapper per split,
not per file. Each mapper generates output for all reduces. The number
of reduces is configured by the user. So, if you've configured 2
reduces, each mapper would typically produce 2 map outputs.

> When reduce tasks fetches the map outputs, they have to know which map
> output he can get. How each reduce knows which map output can get? Who give
> this information to him?

Each reduce task is assigned a 'partition' number by the framework
when it is created. A partition number maps to the particular map
output that a reduce task must fetch output from.
>
> - The distribution of map outputs to the reducers calls partitioning?
>

The allocation of map outputs to reduces happens via partitioning. The
actual data transfer is referred as the Shuffle phase.

> - The reducers, during the sort phase, knows already which map output he
> should get, right?

During the Shuffle phase.

> On Wed, Jun 30, 2010 at 9:36 PM, Arun C Murthy <ac...@yahoo-inc.com> wrote:
>>
>> On Jun 30, 2010, at 1:29 PM, Pedro Costa wrote:
>>>
>>> As I understand from what I've read, the partition has the purpose to
>>> tell to each reducer which map output it will have.
>>> For example, if I've 3 split files and 2 reduces defined in my example,
>>> on the map side it wil be produce 3 map outputs (one map per split file) and
>>> on the reduce side, it will be produced 2 part-* files. The part_00000 it
>>> will contains the results of 2 map outputs and the part_00001 will contain
>>> the results of 1 map output.
>>>
>>
>> Typically, each map produces output for each reduce.
>>
>> In your e.g. part-00000 will contain output of reduce-0 and part-00001
>> will contain output of reduce-1.
>>
>>> - My question is, where in the Hadoop MR is set "which Reduce contains
>>> which Map Output"? Is it during the creation of the reduce tasks, or is in
>>> another phase of the MR?
>>>
>>> - Can you point me which class does this distribution of map outputs to
>>> the reduce tasks?
>>>
>>
>> Take a look at the Partitioner - the partitioner for the job decides which
>> keys are sent to which reduce.
>>
>> Arun
>
>
>
> --
> Pedro
>

Re: Partition - Distribution of map outputs

Posted by Pedro Costa <ps...@gmail.com>.
- I'm running the wordcount example that accepts 3 small txt files as input.
I assume that there will exist 3 mappers that produce 3 map outputs. One map
output per txt file, right?

When reduce tasks fetches the map outputs, they have to know which map
output he can get. How each reduce knows which map output can get? Who give
this information to him?

- The distribution of map outputs to the reducers calls partitioning?

- The reducers, during the sort phase, knows already which map output he
should get, right?



On Wed, Jun 30, 2010 at 9:36 PM, Arun C Murthy <ac...@yahoo-inc.com> wrote:

>
> On Jun 30, 2010, at 1:29 PM, Pedro Costa wrote:
>
>> As I understand from what I've read, the partition has the purpose to tell
>> to each reducer which map output it will have.
>> For example, if I've 3 split files and 2 reduces defined in my example, on
>> the map side it wil be produce 3 map outputs (one map per split file) and on
>> the reduce side, it will be produced 2 part-* files. The part_00000 it will
>> contains the results of 2 map outputs and the part_00001 will contain the
>> results of 1 map output.
>>
>>
> Typically, each map produces output for each reduce.
>
> In your e.g. part-00000 will contain output of reduce-0 and part-00001 will
> contain output of reduce-1.
>
>
>  - My question is, where in the Hadoop MR is set "which Reduce contains
>> which Map Output"? Is it during the creation of the reduce tasks, or is in
>> another phase of the MR?
>>
>> - Can you point me which class does this distribution of map outputs to
>> the reduce tasks?
>>
>>
> Take a look at the Partitioner - the partitioner for the job decides which
> keys are sent to which reduce.
>
> Arun
>



-- 
Pedro

Re: Partition - Distribution of map outputs

Posted by Arun C Murthy <ac...@yahoo-inc.com>.
On Jun 30, 2010, at 1:29 PM, Pedro Costa wrote:
> As I understand from what I've read, the partition has the purpose  
> to tell to each reducer which map output it will have.
> For example, if I've 3 split files and 2 reduces defined in my  
> example, on the map side it wil be produce 3 map outputs (one map  
> per split file) and on the reduce side, it will be produced 2 part-*  
> files. The part_00000 it will contains the results of 2 map outputs  
> and the part_00001 will contain the results of 1 map output.
>

Typically, each map produces output for each reduce.

In your e.g. part-00000 will contain output of reduce-0 and part-00001  
will contain output of reduce-1.

> - My question is, where in the Hadoop MR is set "which Reduce  
> contains which Map Output"? Is it during the creation of the reduce  
> tasks, or is in another phase of the MR?
>
> - Can you point me which class does this distribution of map outputs  
> to the reduce tasks?
>

Take a look at the Partitioner - the partitioner for the job decides  
which keys are sent to which reduce.

Arun