You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by Devaraj Das <dd...@yahoo-inc.com> on 2006/11/15 06:00:53 UTC
Sort algorithms + interface
Here is a summary of some discussions on Sort algorithms (the implementation
work will be a part of Hadoop-331).
I did some benchmarking of the sort algorithms w.r.t our needs. Attached is
a table containing the results. I have benchmarked Hadoop's MergeSort,
QuickSort (a rather primitive one, without many optimizations, from
http://svn.apache.org/repos/asf/jakarta/turbine/core/trunk/src/java/org/apac
he/turbine/util/QuickSort.java
and java.util.Arrays.sort. I have also attached the source code for the
benchmark programs. The key/value pairs are IntWritables. The input to any
sort is an array of offsets to the beginning of key/value pairs in a buffer
(containing 'n' key/value pairs). The output of any sort is a sorted array
of offsets such that key_at_buffer[output[i]] < key_at_buffer[output[i+1]].
By the way, the java.util.Arrays.sort(Object[]) also uses MergeSort (note -
not QuickSort) & the only difference with Hadoop's MergeSort is that it uses
an array of OBJECTs as opposed to an array of INTs. It seems like the
overhead is higher in Java's Arrays.sort case both in terms of memory
footprint and the time it takes. Of
course, the input is the worst case input - 4 byte keys, and in reality we
will probably have bigger keys and thereby end up filling the
fixed-size-in-memory-map-buffer with much lesser number of keys.
In the attached table,
Time: in milliseconds, Memory: in bytes
Input: Same input for each run of a sort algo. For e.g., in the table below,
for 1000 records, the input that produced the1st result for MergeSort
(Time:26, Memory: 229008) also produced the 1st results for QuickSort (Time:
33 Memory: 229008) and Arrays.sort (Time: 27 Memory: 243016). Ran each algo
with 5 sample inputs for each category of #records. The records are
IntWritable key/value pairs.
Also, attached is an interface called SorterBase (will be a package private
interface). The way I see things is that we (hadoop developers) would have
an implementation of the interface, let's say, according to the design spec
on Hadoop-331 (in that the sort data structures are all arrays of ints).
Let's call the implementation SorterBaseImpl1. This would have the
implementation of all the methods of the interface except "sort" (which will
be an empty method). Classes MergeSort, QuickSort and HybridSort would
extend SorterBaseImpl1 and implement the sort() method. All these algorithms
would access the base class's datastructures and since these algorithms work
with very similar datastructures they can extend from one base class
-SorterBaseImpl1.
If we want to have Java based sorting (like java.util.Arrays.sort()), then
we need to implement the interface in such a way that all the datastructures
are created that way (like array of offset "objects" as opposed to int
arrays, etc.).
If we want to integrate a C sort, then we need to implement the interface as
such
(maybe it will be possible to reuse existing implementations like
SorterBaseImpl1; implement just the sort method differently, and have JNI
wrappers to get access to the Java datastructures within SorterBaseImpl1).
An interesting suggestion here is looking at how the C sort algorithms can
be used as it is in conjunction with streaming.
Generally speaking, the expectation from any algorithm is that it can sort
an array of indirect pointers to a buffer. If an algorithm can do that, it
should be fairly easy to accommodate the algorithm in this framework.
Hadoop can have a configurable item called "map.sort.class" which is one of
MergeSort.class, QuickSort.class, etc. and it instantiates that class and
works with that (via methods defined in the interface).
The other sort algorithms that can be looked at are STL's sort,
http://www.sgi.com/tech/stl/sort.html
and optimized QuickSort (Hadoop-287).
Thanks to Eric, Doug, Sam, and Ben for the inputs.