You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Owen O'Malley <om...@apache.org> on 2010/06/01 05:11:44 UTC

Re: almost sorted map output

On Fri, May 28, 2010 at 10:58 PM, juber patel <ju...@gmail.com> wrote:
> Hello,
>
> Can Hadoop take advantage of the fact that the output of each map task
> is almost sorted?

The map output sort is efficient for mostly sorted output. The
dominant cost is the transfer costs of the shuffle and that won't be
helped by almost sorted output. There is an optimization that we have
discussed for years but not implemented where if you have lop-sided
partitions (such as largely sorted data and a total order
partitioner), you schedule the reduce close to the majority of its
data. In some applications that would be a large win.

> On a related note, Does Hadoop's Quicksort implementation give worst
> case performance on almost sorted data? Should I use Heapsort in its
> place?

The quick sort used on the map side is careful about pivots and the
resulting partitions. I wouldn't worry about replacing it.