You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Vadim Zaliva <kr...@gmail.com> on 2008/01/15 22:17:58 UTC

single output file

Hi!

I have a novice question. I have data consisting of (Text, Long)  
tuples. I need to calculate sum
of the values. The way I am achieving it now is mapping Text key to a  
constant value Text("Total") and using
LongSumReducer as both Combiner and Reducer. It seems to be working  
except that I get many 0-bytes
output files and just one non-empty file with the actual result. If  
there is a way to avoid creation
of these empty files? Thanks!

Vadim

Re: single output file

Posted by Vadim Zaliva <kr...@gmail.com>.

On Jan 15, 2008, at 13:57, Ted Dunning wrote:

> This is happening because you have many reducers running, only one  
> of which
> gets any data.
>
> Since you have combiners, this probably isn't a problem.  That reducer
> should only get as many records as you have maps.  It would be a  
> problem if
> your reducer were getting lots of input records.
>
> You can avoid this by setting the number of reducers to 1.

Thanks!

I also have another, perhaps stupid question. I am trying to write a  
task which will produce a list of records with top N values. My idea  
is to write a reducer class which iterates through records keeping N  
with biggest values and spits them out. I can use it as both a  
combiner and reducer class. This way each MAP task will produce N  
records and I will set up single reduce task which will combine them  
into final N records. (N is reasonably small, like 10). However to do  
this I  need to postpone issuing output until I am done processing all  
records. I can try to do this in close() method, but I do not have an  
OutputCollector there. I guess I can write special output collector,  
but it seems a bit artificial.

Probably I am missing something obvious and there is a common and easy  
way to do this?

Thanks!

Sincerely,
Vadim

Re: single output file

Posted by Ted Dunning <td...@veoh.com>.

This is happening because you have many reducers running, only one of which
gets any data.

Since you have combiners, this probably isn't a problem.  That reducer
should only get as many records as you have maps.  It would be a problem if
your reducer were getting lots of input records.

You can avoid this by setting the number of reducers to 1.

On 1/15/08 1:17 PM, "Vadim Zaliva" <kr...@gmail.com> wrote:

> Hi!
> 
> I have a novice question. I have data consisting of (Text, Long)
> tuples. I need to calculate sum
> of the values. The way I am achieving it now is mapping Text key to a
> constant value Text("Total") and using
> LongSumReducer as both Combiner and Reducer. It seems to be working
> except that I get many 0-bytes
> output files and just one non-empty file with the actual result. If
> there is a way to avoid creation
> of these empty files? Thanks!
> 
> Vadim
>