You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Savannah Beckett <sa...@yahoo.com> on 2010/12/26 06:05:52 UTC

How to do Secondary Sort on a String and a float?

I am writing a Secondary Sort to sort a String key and float value.  I am 
following the example in 
mapred/src/examples/org/apache/hadoop/examples/SecondarySort.java in the hadoop 
package.  The example is for a pair of integers.  I did lots of research online 
but most of them were still using the old API.  It seems that for the new API, I 
have to implement the RawComparator interface which means I need to write the 
compare byte function no matter what.  


I have problem with this code:
  public static class FirstGroupingComparator
                implements RawComparator<IntPair> {
    @Override
    public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
      return WritableComparator.compareBytes(b1, s1, Integer.SIZE/8,
                                             b2, s2, Integer.SIZE/8);
    }
    @Override
    public int compare(IntPair o1, IntPair o2) {
      int l = o1.getFirst();
      int r = o2.getFirst();
      return l == r ? 0 : (l < r ? -1 : 1);
    }
  }


How do I write the code inside the first compare function?  What should I put as 
the length of the String and float (primitive type) in the compareBytes 
function?  Does anyone have any examples for a pair of String and float?

Thanks.  Merry Christmas.



      

Re: How to do Secondary Sort on a String and a float?

Posted by Harsh J <qw...@gmail.com>.
Hi,

You can use WritableComparator for "Writable" serializations. Docs
here: http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/WritableComparator.html

The issue lies with how you're encoding your pair of <String, Float>.
If you know sizes defined for each (or have a marker byte between,
etc.), you can extract the bytes out of the required object alone
(String or Float) and use the compareBytes function on it. The "s1 &
s2" define start points, and "l1 and l2" define lengths to read from
"s1 & s2" points -- on the passed byte[] arrays for the two "Writable"
objects.

You can also, perhaps, de-serialize the whole byte stream (via your
Writable.readFields()) and then compare object-wise -- but this would
make it slow, since byte-to-byte comparisions are faster, hence
RawComparator.

Avro has a neat serialization, I prefer using it over plain Writables.
Working with a "Schema" is much more easier.

-- 
Harsh J
www.harshj.com