You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by David Hawthorne <dh...@3crowd.com> on 2010/07/15 22:26:12 UTC

how to do a reduce-only job

I have two previously created output files of format:

key[tab]value

where key is text, value is an integer sum of how many times the key appeared.

I would like to reduce these output files together into one new output file.  I'm having problems finding out how to do this.

I've found ways to specify a job with no reducers, but it doesn't look like there's a way to specify a reduce-only job, aside from using the streaming interface with 'cat' as the mapper.  I'm not opposed to this, but I also couldn't find a way to specify 'cat' as a mapper and the reducer in my java class as the reducer.  I'm also not sure this would work, as the reducer might simply see the entire line emitted by cat as the key.  I could use awk as the reducer, but I've heard that streaming is less performant than java, and I've already got the java class written. I could write another java class with a mapper that splits in the value on tab and emits the two fields as <key, value>, but that seems like it would be extra work and less optimal than being able to run a reduce-only job.

So... what are the options?  Is there a way to specify a reduce-only job?

Re: how to do a reduce-only job

Posted by Asif Jan <As...@unige.ch>.
you need to join these files into 1; you could ether do a map-side  
join, or reduce-side join

for map-side join (slightly more involved)  look at example:

org.apache.hadoop.examples.Join

for reduce side join simply create 2 mappers (one for each file) and  
one reduce (as long as you keep key-value for both same)
You will have to use mutliple input format for doing so.

e.g.
MultipleInputs.addInputPath(conf, path1, input_format1, mapper_class1)
MultipleInputs.addInputPath(conf, path2, input_format2, mapper_class2)

The javadoc of the class explains it further.

cheers





On Jul 15, 2010, at 10:26 PM, David Hawthorne wrote:

> I have two previously created output files of format:
>
> key[tab]value
>
> where key is text, value is an integer sum of how many times the key  
> appeared.
>
> I would like to reduce these output files together into one new  
> output file.  I'm having problems finding out how to do this.
>
> I've found ways to specify a job with no reducers, but it doesn't  
> look like there's a way to specify a reduce-only job, aside from  
> using the streaming interface with 'cat' as the mapper.  I'm not  
> opposed to this, but I also couldn't find a way to specify 'cat' as  
> a mapper and the reducer in my java class as the reducer.  I'm also  
> not sure this would work, as the reducer might simply see the entire  
> line emitted by cat as the key.  I could use awk as the reducer, but  
> I've heard that streaming is less performant than java, and I've  
> already got the java class written. I could write another java class  
> with a mapper that splits in the value on tab and emits the two  
> fields as <key, value>, but that seems like it would be extra work  
> and less optimal than being able to run a reduce-only job.
>
> So... what are the options?  Is there a way to specify a reduce-only  
> job?