You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Joel Welling <we...@psc.edu> on 2008/09/04 01:02:12 UTC

Trouble doing two-step sort

Hi folks;
  I'm trying to do a Hadoop streaming job which involves a two-step
sort, of the type described at
http://hadoop.apache.org/core/docs/r0.18.0/streaming.html#A+Useful
+Partitioner+Class+%28secondary+sort%2C+the+-partitioner
+org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner+option%29 .  
I've got a python script that emits records like:

020700414.0140. 140 12 132475

Over a whole bunch of such records I expect to find about a dozen with
each first number (020700414 in this case), with consecutive values of
the second number (0140 in this case).  I'm using IdentityReducer and
KeyFieldBasedPartitioner.  I'd like to partition them based only on the
first number, because I definitely want all records with the same first
number to end up in the same output file.  Then I'd like those to be
sorted on the second number, so each output file contains a set of
ordered records for each first number, and all records for a given first
number end up in the same file.

I've tried setting these values:

stream.map.output.field.separator=.
map.output.key.field.separator=.
stream.num.map.output.key.fields=2
num.key.fields.for.partition=1

but I end up with each first number evenly distributed over all the
output files!  In each, the records for a given first number appear
together and the second numbers are in the right order.  This isn't the
result I wanted.

Am I misunderstanding this example somehow?  What settings should give
me the expected output order?  

Thanks, I hope;
-Joel Welling
 welling@psc.edu