You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Edward Capriolo <ed...@gmail.com> on 2010/01/08 20:51:36 UTC

Re: Will already sorted Mapper output improve speed of Sort in reducer?

Depends on the sorting algorithm, right? (thinking back to CS classes)
BIG O on bubble sort for already sorted data is great, big-0 on merge
sort for already sorted is close to worse case scenario ? (yes/no)

The input to each reducer is always sorted by key, hence the term
"shuffle SORT" by default you are shuffled based on your partitioner
HASH() and sorted for each reducer input.

Recently added is the InputSampler and TotalOrderPartitioner that
sorts keeps the entire results sorted.

On Fri, Jan 8, 2010 at 2:23 PM, Le Zhao <le...@cs.cmu.edu> wrote:
> Thanks Yongqiang.  That answered my question.
>
> Interesting.  I didn't know that the mapper output is sorted.  Is it the
> case that each map task's output is sorted? or that there can be multiple
> pieces if the map task has too much output?
>
> Thanks,
> Le
>
> Yongqiang He wrote:
>>
>> The mapper output is sorted using the quick-sort at the mapper side
>> (actually the sort algorithm can be pluggable). The reducer only needs to
>> use a merge sort in order to reduce number of files.
>>
>> Right now Hadoop always run a sorter at the mapper side to sort map
>> output.
>> One interesting point is to see how much time can be saved if the mapper's
>> input is sorted and output is also sorted naturally (this is true in most
>> situations if the operation at mapper is only sel, fil, or projection). In
>> this case, the mapper side sorting procedure is actually unneeded.
>>
>>
>> Thanks
>> yongqiang
>> On 1/7/10 7:21 PM, "Le Zhao" <le...@cs.cmu.edu> wrote:
>>
>>> Hi,
>>>
>>> Does anybody know whether sorted Mapper output will decrease the Sort in
>>> the reduce phase?
>>>
>>> I'm teaching a class, and am curious to know how much of a difference
>>> will sorted vs. unsorted mapper output be.  If the merge sort is
>>> implemented to deal with already sorted input, then I guess it will be
>>> fast.  Am I right?
>>>
>>> Thanks,
>>> Le
>>>
>>>
>>
>>
>>
>

Re: Will already sorted Mapper output improve speed of Sort in reducer?

Posted by Le Zhao <le...@cs.cmu.edu>.
Looks like either or both of us are misunderstanding,

Edward Capriolo wrote:
> Depends on the sorting algorithm, right? (thinking back to CS classes)
> BIG O on bubble sort for already sorted data is great, big-0 on merge
> sort for already sorted is close to worse case scenario ? (yes/no)
> 

It would be faster if the merge sort algorithm knows that each chunk of 
input is already sorted, only linear merging is needed.

Is it the case that in the current Sort, no assumption about the 
ordering of the input records is assumed?  If so, why sort the output of 
Map at the mapper side? (except for creating input for the combiner.)

> On Fri, Jan 8, 2010 at 2:23 PM, Le Zhao <le...@cs.cmu.edu> wrote:
>> Thanks Yongqiang.  That answered my question.
>>
>> Interesting.  I didn't know that the mapper output is sorted.  Is it the
>> case that each map task's output is sorted? or that there can be multiple
>> pieces if the map task has too much output?

Let me rephrase the question above: I know that it will be sorted during 
reduce, but will it be sorted immediately after map before shuffle?

If it is, then to what extent will the map output be sorted by the 
mapper?  Will each map output several lists of internally sorted 
records? or just one consistently sorted list of records?

Thanks,
Le