You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Keren Ouaknine <ke...@gmail.com> on 2014/08/28 23:09:30 UTC

RandomSampler and TotalOrderPartitioner

Hello,

I am running a global sort (on Pigmix input data, size 600GB) based
on TotalOrderPartitioner. The best practice according to the literature
points to data sampling using RandomSampler. The query succeeds but takes a
very long time (7 hours) and that's because there is only one reducer
(which nullifies the point of using the above classes :) ).
I am trying to figure out what forces the # of reducers to be *one*, as I
defined them to be* 400*. I looked into the documentation and in the code
of RandomSampler, there is a requirement which says:

// Set the path to the SequenceFile storing the sorted partition keyset. It
must be the case that for R reduces, there are R-1 keys in the SequenceFile.


And therefore I sampled as follows:

*InputSampler.Sampler<Text, Text> sampler =new
InputSampler.RandomSampler<Text, Text>(0.9, 399, 444);*

Looking into my _partition file I can see there is only one partition which
explains the one reducer:
SEQ org.apache.hadoop.io.Text!org.apache.hadoop.io.NullWritable

I am wondering how come the partition file contains only one sample, though
I asked for 399 samples above?

Thanks for the help!!
Keren


-- 
Keren Ouaknine
www.kereno.com