You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Dieter Plaetinck <di...@intec.ugent.be> on 2011/03/29 16:55:46 UTC

# of keys per reducer invocation (streaming api)

Hi, I'm using the streaming API and I notice my reducer gets - in the same
invocation - a bunch of different keys, and I wonder why.
I would expect to get one key per reducer run, as with the "normal"
hadoop.

Is this to limit the amount of spawned processes, assuming creating and
destroying processes is usually expensive compared to the amount of
work they'll need to do (not much, if you have many keys with each a
handful of values)?

OTOH if you have a high number of values over a small number of keys, I
would rather stick to one-key-per-reducer-invocation, then I don't need
to worry about supporting (and allocating memory for) multiple input
keys.  Is there a config setting to enable such behavior?

Maybe I'm missing something, but this seems like a big difference in
comparison to the default way of working, and should maybe be added to
the FAQ at
http://hadoop.apache.org/common/docs/r0.20.2/streaming.html#Frequently+Asked+Questions

thanks,
Dieter

Re: # of keys per reducer invocation (streaming api)

Posted by Dieter Plaetinck <di...@intec.ugent.be>.
On Tue, 29 Mar 2011 23:17:13 +0530
Harsh J <qw...@gmail.com> wrote:

> Hello,
> 
> On Tue, Mar 29, 2011 at 8:25 PM, Dieter Plaetinck
> <di...@intec.ugent.be> wrote:
> > Hi, I'm using the streaming API and I notice my reducer gets - in
> > the same invocation - a bunch of different keys, and I wonder why.
> > I would expect to get one key per reducer run, as with the "normal"
> > hadoop.
> >
> > Is this to limit the amount of spawned processes, assuming creating
> > and destroying processes is usually expensive compared to the
> > amount of work they'll need to do (not much, if you have many keys
> > with each a handful of values)?
> >
> > OTOH if you have a high number of values over a small number of
> > keys, I would rather stick to one-key-per-reducer-invocation, then
> > I don't need to worry about supporting (and allocating memory for)
> > multiple input keys.  Is there a config setting to enable such
> > behavior?
> >
> > Maybe I'm missing something, but this seems like a big difference in
> > comparison to the default way of working, and should maybe be added
> > to the FAQ at
> > http://hadoop.apache.org/common/docs/r0.20.2/streaming.html#Frequently+Asked+Questions
> >
> > thanks,
> > Dieter
> >
> 
> I think it would make more sense to think of streaming programs as
> complete map/reduce 'tasks', instead of trying to apply the Map/Reduce
> functional concept. Both of the programs need to be written from the
> reading level onwards, which in map's case each line is record input
> and in reduce's case it is one uniquely grouped key and all values
> associated to it. One would need to handle the reading-loop
> themselves.
> 
> Some non-Java libraries that provide abstractions atop the
> streaming/etc. layer allow for more fluent representations of the
> map() and reduce() functions, hiding away the other fine details (like
> the Java API). Dumbo[1] is such a library for Python Hadoop Map/Reduce
> programs, for example.
> 
> A FAQ entry on this should do good too! You can file a ticket for an
> addition of this observation to the streaming docs' FAQ.
> 
> [1] - https://github.com/klbostee/dumbo/wiki/Short-tutorial
> 

Thanks,
this makes it a little clearer.
I made a ticket @ https://issues.apache.org/jira/browse/MAPREDUCE-2410

Dieter

Re: # of keys per reducer invocation (streaming api)

Posted by Harsh J <qw...@gmail.com>.
Hello,

On Tue, Mar 29, 2011 at 8:25 PM, Dieter Plaetinck
<di...@intec.ugent.be> wrote:
> Hi, I'm using the streaming API and I notice my reducer gets - in the same
> invocation - a bunch of different keys, and I wonder why.
> I would expect to get one key per reducer run, as with the "normal"
> hadoop.
>
> Is this to limit the amount of spawned processes, assuming creating and
> destroying processes is usually expensive compared to the amount of
> work they'll need to do (not much, if you have many keys with each a
> handful of values)?
>
> OTOH if you have a high number of values over a small number of keys, I
> would rather stick to one-key-per-reducer-invocation, then I don't need
> to worry about supporting (and allocating memory for) multiple input
> keys.  Is there a config setting to enable such behavior?
>
> Maybe I'm missing something, but this seems like a big difference in
> comparison to the default way of working, and should maybe be added to
> the FAQ at
> http://hadoop.apache.org/common/docs/r0.20.2/streaming.html#Frequently+Asked+Questions
>
> thanks,
> Dieter
>

I think it would make more sense to think of streaming programs as
complete map/reduce 'tasks', instead of trying to apply the Map/Reduce
functional concept. Both of the programs need to be written from the
reading level onwards, which in map's case each line is record input
and in reduce's case it is one uniquely grouped key and all values
associated to it. One would need to handle the reading-loop
themselves.

Some non-Java libraries that provide abstractions atop the
streaming/etc. layer allow for more fluent representations of the
map() and reduce() functions, hiding away the other fine details (like
the Java API). Dumbo[1] is such a library for Python Hadoop Map/Reduce
programs, for example.

A FAQ entry on this should do good too! You can file a ticket for an
addition of this observation to the streaming docs' FAQ.

[1] - https://github.com/klbostee/dumbo/wiki/Short-tutorial

-- 
Harsh J
http://harshj.com