You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by "Chinni, Ravi" <rc...@syncsort.com> on 2010/07/26 22:32:56 UTC

Why does the MR framework sorts the mapper output?

I have an MR application that is running fine except for the
performance. Increasing the number of data nodes is not an option to me.



Looking at the source code of MR framework, I noticed that the
partitioned output of each mapper is sorted (MapTask.java), and on the
reduce side partitions from various mappers are merged (ReduceTask.java)
before running the reduce step. Functionally, reducers in my application
does not require data to be in sorted order and getting rid of the sort
and merge steps in the framework will help my application.



Does anyone know, why the sort and merge of intermediate data is being
done by the framework? Is there anything - MR functional concepts,
framework design etc. - that will need the sort and merge of
intermediate data? I want to give a shot in getting rid of the sort and
merge steps in the framework and want to know of any potential risks
involved.



Any input is appreciated.



Thanks,

Ravi





_____________________________________________________________________________

ATTENTION:

The information contained in this message (including any files transmitted 
with this message) may contain proprietary, trade secret or other 
confidential and/or legally privileged information. Any pricing 
information contained in this message or in any files transmitted with 
this message is always confidential and cannot be shared with any third 
parties without prior written approval from Syncsort. This message is 
intended to be read only by the individual or entity to whom it is 
addressed or by their designee. If the reader of this message is not the 
intended recipient, you are on notice that any use, disclosure, copying or 
distribution of this message, in any form, is strictly prohibited. If you 
have received this message in error, please immediately notify the sender 
and/or Syncsort and destroy all copies of this message in your possession, 
custody or control.

RE: Why does the MR framework sorts the mapper output?

Posted by "Chinni, Ravi" <rc...@syncsort.com>.
Thanks Alex and Ken.

 

My application does not do aggregation. It mainly does some data
cleansing and transformation. So I don't need a combiner. (Also, I don't
see why a combiner always needs sorted input; it should be optional and
user specified)

 

To take advantage of some optimizations, I need a partitioner - this
means most of my application logic is in the reducers and cannot set the
# of reducers to 0. Of course, I will be happy if there is a way to set
the # of mappers to 0.

 

Ravi

 

 

From: Ken Goodhope [mailto:kengoodhope@gmail.com] 
Sent: Monday, July 26, 2010 8:00 PM
To: mapreduce-user@hadoop.apache.org
Subject: Re: Why does the MR framework sorts the mapper output?

 

The combiner needs sorted input.

On Mon, Jul 26, 2010 at 1:46 PM, Alex Kozlov <al...@cloudera.com>
wrote:

Hi Ravi,

Whether a sort is required is still a point of debate: the primary
reason is to collect the entries with the same key, but one can
implement MapReduce with hash deduping.  The performance
advantages/disadvantages are still a subject of debate.

If you don't need sorting, you can always implement map-side aggregation
though and potentially set the # of reducers to 0.  There is no
potential risk, but if you want to aggregate results across different
mappers you'll get back to the original problem.

Alex K  

 

On Mon, Jul 26, 2010 at 1:32 PM, Chinni, Ravi <rc...@syncsort.com>
wrote:

I have an MR application that is running fine except for the
performance. Increasing the number of data nodes is not an option to me.

 

Looking at the source code of MR framework, I noticed that the
partitioned output of each mapper is sorted (MapTask.java), and on the
reduce side partitions from various mappers are merged (ReduceTask.java)
before running the reduce step. Functionally, reducers in my application
does not require data to be in sorted order and getting rid of the sort
and merge steps in the framework will help my application. 

 

Does anyone know, why the sort and merge of intermediate data is being
done by the framework? Is there anything - MR functional concepts,
framework design etc. - that will need the sort and merge of
intermediate data? I want to give a shot in getting rid of the sort and
merge steps in the framework and want to know of any potential risks
involved.

 

Any input is appreciated.

 

Thanks,

Ravi

 

 

________________________________________________________________________
_____

 

ATTENTION:

 

The information contained in this message (including any files
transmitted with this message) may contain proprietary, trade secret or
other  confidential and/or legally privileged information. Any pricing
information contained in this message or in any files transmitted with
this message is always confidential and cannot be shared with any third
parties without prior written approval from Syncsort. This message is
intended to be read only by the individual or entity to whom it is
addressed or by their designee. If the reader of this message is not the
intended recipient, you are on notice that any use, disclosure, copying
or distribution of this message, in any form, is strictly prohibited. If
you have received this message in error, please immediately notify the
sender and/or Syncsort and destroy all copies of this message in your
possession, custody or control.

 

 


Re: Why does the MR framework sorts the mapper output?

Posted by Ken Goodhope <ke...@gmail.com>.
The combiner needs sorted input.

On Mon, Jul 26, 2010 at 1:46 PM, Alex Kozlov <al...@cloudera.com> wrote:

> Hi Ravi,
>
> Whether a sort is required is still a point of debate: the primary reason
> is to collect the entries with the same key, but one can implement MapReduce
> with hash deduping.  The performance advantages/disadvantages are still a
> subject of debate.
>
> If you don't need sorting, you can always implement map-side aggregation
> though and potentially set the # of reducers to 0.  There is no potential
> risk, but if you want to aggregate results across different mappers you'll
> get back to the original problem.
>
> Alex K
>
> On Mon, Jul 26, 2010 at 1:32 PM, Chinni, Ravi <rc...@syncsort.com>wrote:
>
>>   I have an MR application that is running fine except for the
>> performance. Increasing the number of data nodes is not an option to me.
>>
>>
>>
>> Looking at the source code of MR framework, I noticed that the partitioned
>> output of each mapper is sorted (MapTask.java), and on the reduce side
>> partitions from various mappers are merged (ReduceTask.java) before running
>> the reduce step. Functionally, reducers in my application does not require
>> data to be in sorted order and getting rid of the sort and merge steps in
>> the framework will help my application.
>>
>>
>>
>> Does anyone know, why the sort and merge of intermediate data is being
>> done by the framework? Is there anything - MR functional concepts, framework
>> design etc. - that will need the sort and merge of intermediate data? I want
>> to give a shot in getting rid of the sort and merge steps in the framework
>> and want to know of any potential risks involved.
>>
>>
>>
>> Any input is appreciated.
>>
>>
>>
>> Thanks,
>>
>> Ravi
>>
>>
>>
>>
>> _____________________________________________________________________________
>>
>>  ATTENTION:
>>
>> The information contained in this message (including any files transmitted
>> with this message) may contain proprietary, trade secret or other
>> confidential and/or legally privileged information. Any pricing information
>> contained in this message or in any files transmitted with this message is
>> always confidential and cannot be shared with any third parties without
>> prior written approval from Syncsort. This message is intended to be read
>> only by the individual or entity to whom it is addressed or by their
>> designee. If the reader of this message is not the intended recipient, you
>> are on notice that any use, disclosure, copying or distribution of this
>> message, in any form, is strictly prohibited. If you have received this
>> message in error, please immediately notify the sender and/or Syncsort and
>> destroy all copies of this message in your possession, custody or control.
>>
>
>

Re: Why does the MR framework sorts the mapper output?

Posted by Alex Kozlov <al...@cloudera.com>.
Hi Ravi,

Whether a sort is required is still a point of debate: the primary reason is
to collect the entries with the same key, but one can implement MapReduce
with hash deduping.  The performance advantages/disadvantages are still a
subject of debate.

If you don't need sorting, you can always implement map-side aggregation
though and potentially set the # of reducers to 0.  There is no potential
risk, but if you want to aggregate results across different mappers you'll
get back to the original problem.

Alex K

On Mon, Jul 26, 2010 at 1:32 PM, Chinni, Ravi <rc...@syncsort.com> wrote:

>   I have an MR application that is running fine except for the
> performance. Increasing the number of data nodes is not an option to me.
>
>
>
> Looking at the source code of MR framework, I noticed that the partitioned
> output of each mapper is sorted (MapTask.java), and on the reduce side
> partitions from various mappers are merged (ReduceTask.java) before running
> the reduce step. Functionally, reducers in my application does not require
> data to be in sorted order and getting rid of the sort and merge steps in
> the framework will help my application.
>
>
>
> Does anyone know, why the sort and merge of intermediate data is being done
> by the framework? Is there anything - MR functional concepts, framework
> design etc. - that will need the sort and merge of intermediate data? I want
> to give a shot in getting rid of the sort and merge steps in the framework
> and want to know of any potential risks involved.
>
>
>
> Any input is appreciated.
>
>
>
> Thanks,
>
> Ravi
>
>
>
>
> _____________________________________________________________________________
>
>  ATTENTION:
>
> The information contained in this message (including any files transmitted
> with this message) may contain proprietary, trade secret or other
> confidential and/or legally privileged information. Any pricing information
> contained in this message or in any files transmitted with this message is
> always confidential and cannot be shared with any third parties without
> prior written approval from Syncsort. This message is intended to be read
> only by the individual or entity to whom it is addressed or by their
> designee. If the reader of this message is not the intended recipient, you
> are on notice that any use, disclosure, copying or distribution of this
> message, in any form, is strictly prohibited. If you have received this
> message in error, please immediately notify the sender and/or Syncsort and
> destroy all copies of this message in your possession, custody or control.
>