You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Karl Wettin <ka...@gmail.com> on 2008/04/16 21:07:07 UTC

aborting reducer

I have a job that out of a list with object finds the one with least 
distance to a given test object. All my reducer does is to collect the 
first result and ignore the rest.

 > private boolean processed = false;
 > public void reduce(DoubleWritable distance, Iterator<Long> keys,
 >              OutputCollector<DoubleWritable, LongWritable> output,
 >              Reporter reporter)
 >    throws IOException {
 >   if (processed) {
 >     return;
 >   }
 >   collector.collect(distance, keys.next());
 > }

I'm not sure if I do something fundamentally wrong or designing the 
mapper and the reducer or if I came up with a new use case, but it feels 
very inefficient to iterate through all those records and deserialize 
them just to ignore the value. Went looking in the code base to see if 
it was possible to abort the reduction/combintion iteration and found 
that a simple enough solution would be to throw some exception (or have 
reduce return a boolean).


     karl

Re: aborting reducer

Posted by Ted Dunning <td...@veoh.com>.
Would it be better to have lots of records arrive at the same reducer?

That has a simpler mechanism for ignoring data.

You can just add a (trivial) partition function in addition to your sort.


On 4/16/08 12:07 PM, "Karl Wettin" <ka...@gmail.com> wrote:

> I have a job that out of a list with object finds the one with least
> distance to a given test object. All my reducer does is to collect the
> first result and ignore the rest.
> 
>> private boolean processed = false;
>> public void reduce(DoubleWritable distance, Iterator<Long> keys,
>>              OutputCollector<DoubleWritable, LongWritable> output,
>>              Reporter reporter)
>>    throws IOException {
>>   if (processed) {
>>     return;
>>   }
>>   collector.collect(distance, keys.next());
>> }
> 
> I'm not sure if I do something fundamentally wrong or designing the
> mapper and the reducer or if I came up with a new use case, but it feels
> very inefficient to iterate through all those records and deserialize
> them just to ignore the value. Went looking in the code base to see if
> it was possible to abort the reduction/combintion iteration and found
> that a simple enough solution would be to throw some exception (or have
> reduce return a boolean).
> 
> 
>      karl