You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by Devaraj Das <dd...@yahoo-inc.com> on 2006/11/15 06:00:53 UTC
Sort algorithms + interface

Here is a summary of some discussions on Sort algorithms (the implementation
work will be a part of Hadoop-331).

 

I did some benchmarking of the sort algorithms w.r.t our needs. Attached is
a table containing the results. I have benchmarked Hadoop's MergeSort,
QuickSort (a rather primitive one, without many optimizations, from
http://svn.apache.org/repos/asf/jakarta/turbine/core/trunk/src/java/org/apac
he/turbine/util/QuickSort.java

and java.util.Arrays.sort. I have also attached the source code for the
benchmark programs. The key/value pairs are IntWritables. The input to any
sort is an array of offsets to the beginning of key/value pairs in a buffer
(containing 'n' key/value pairs). The output of any sort is a sorted array
of offsets such that key_at_buffer[output[i]] < key_at_buffer[output[i+1]].

 

By the way, the java.util.Arrays.sort(Object[]) also uses MergeSort (note -
not QuickSort) & the only difference with Hadoop's MergeSort is that it uses
an array of OBJECTs as opposed to an array of INTs. It seems like the
overhead is higher in Java's Arrays.sort case both in terms of memory
footprint and the time it takes. Of

course, the input is the worst case input - 4 byte keys, and in reality we
will probably have bigger keys and thereby end up filling the
fixed-size-in-memory-map-buffer with much lesser number of keys.

 

In the attached table,

 

Time: in milliseconds, Memory: in bytes

 

Input: Same input for each run of a sort algo. For e.g., in the table below,
for 1000 records, the input that produced the1st result for MergeSort
(Time:26, Memory: 229008) also produced the 1st results for QuickSort (Time:
33 Memory: 229008) and Arrays.sort (Time: 27 Memory: 243016). Ran each algo
with 5 sample inputs for each category of #records. The records are
IntWritable key/value pairs.

 

 

Also, attached is an interface called SorterBase (will be a package private
interface). The way I see things is that we (hadoop developers) would have
an implementation of the interface, let's say, according to the design spec
on Hadoop-331 (in that the sort data structures are all arrays of ints).
Let's call the implementation SorterBaseImpl1. This would have the
implementation of all the methods of the interface except "sort" (which will
be an empty method). Classes MergeSort, QuickSort and HybridSort would
extend SorterBaseImpl1 and implement the sort() method. All these algorithms
would access the base class's datastructures and since these algorithms work
with very similar datastructures they can extend from one base class
-SorterBaseImpl1.

 

If we want to have Java based sorting (like java.util.Arrays.sort()), then

we need to implement the interface in such a way that all the datastructures

are created that way (like array of offset "objects" as opposed to int

arrays, etc.).

 

If we want to integrate a C sort, then we need to implement the interface as
such

(maybe it will be possible to reuse existing implementations like

SorterBaseImpl1; implement just the sort method differently, and have JNI

wrappers to get access to the Java datastructures within SorterBaseImpl1).
An interesting suggestion here is looking at how the C sort algorithms can
be used as it is in conjunction with streaming. 

 

Generally speaking, the expectation from any algorithm is that it can sort

an array of indirect pointers to a buffer. If an algorithm can do that, it

should be fairly easy to accommodate the algorithm in this framework.

 

Hadoop can have a configurable item called "map.sort.class" which is one of

MergeSort.class, QuickSort.class, etc. and it instantiates that class and

works with that (via methods defined in the interface).

 

The other sort algorithms that can be looked at are STL's sort,
http://www.sgi.com/tech/stl/sort.html

and optimized QuickSort (Hadoop-287).

 

Thanks to Eric, Doug, Sam, and Ben for the inputs.