You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@couchdb.apache.org by James Marca <jm...@translab.its.uci.edu> on 2009/02/10 23:31:58 UTC

how do I do different reduce operations on the same map

I have a situation where I want to run two different reduce functions
on the output of a single map function.  Like suppose I want one
reduce function to get the count of all objects in each group (for
example, documents with or without attachments), and another reduce to
compute some other aggregate, like the average and standard deviation
of a value, (like the average size of attached documents).  (Yes, I
know this is a stupid example, as the averaging reduce function will
also have the count, but my real case is too complicated to write
easily).

Should one strive for a minimal set of reduce functions per map (one
reduce for all three count, average, std deviation), or does it make
sense to identically copy the maps and make multiple reduce functions
(one reduce _each_ for count, mean, std dev)?  (again, ignore the fact
that you compute  count and mean when computing std dev, etc)

I have a feeling from reading the various docs that identical map
functions are only executed once in CouchDB.  If that is true, then is
it _also_ true that having lots of reduce functions for one map is not
any more expensive (in terms of space and computational speed) than
trying for a minimal set of map-reduce pairs.  Any advice on this?

Thanks in advance, 
James






-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

Re: how do I do different reduce operations on the same map

Posted by Paul Davis <pa...@gmail.com>.

On Tue, Feb 10, 2009 at 5:31 PM, James Marca
<jm...@translab.its.uci.edu> wrote:
> I have a situation where I want to run two different reduce functions
> on the output of a single map function.  Like suppose I want one
> reduce function to get the count of all objects in each group (for
> example, documents with or without attachments), and another reduce to
> compute some other aggregate, like the average and standard deviation
> of a value, (like the average size of attached documents).  (Yes, I
> know this is a stupid example, as the averaging reduce function will
> also have the count, but my real case is too complicated to write
> easily).
>
> Should one strive for a minimal set of reduce functions per map (one
> reduce for all three count, average, std deviation), or does it make
> sense to identically copy the maps and make multiple reduce functions
> (one reduce _each_ for count, mean, std dev)?  (again, ignore the fact
> that you compute  count and mean when computing std dev, etc)
>
> I have a feeling from reading the various docs that identical map
> functions are only executed once in CouchDB.  If that is true, then is
> it _also_ true that having lots of reduce functions for one map is not
> any more expensive (in terms of space and computational speed) than
> trying for a minimal set of map-reduce pairs.  Any advice on this?
>
> Thanks in advance,
> James
>

You're reading of the docs are spot on. If you have byte identical map
functions, only a single btree is used for both maps. At the moment,
the only way to reuse a single btree with multiple reduce functions is
to do exactly what you suggested and copy your maps and then attach
your reduce functions as necessary.

Before I go on, I should mention that the best way to figure this out
would be to setup a couple benchmarks and measure if there's any
noticeable difference between having multiple reduce functions vs. one
complex one.

That said, with each reduce function, you're adding a round-trip
through the view server every time a reduce is called. I would
cautiously lean towards thinking that this isn't going to be as much
overhead as you might think. Ie, I find it more likely that the view
generation is going to be dominated by other things than this.

The space requirement should be roughly related to the output that
either method would produce. Ie, multiple reduce methods isn't in and
of itself going to cause you to run into problems. The only overhead I
can think of is a bit more for the Erlang serialization of a slightly
different term format for either case.

HTH,
Paul Davis

Re: how do I do different reduce operations on the same map

Posted by James Marca <jm...@translab.its.uci.edu>.

On Wed, Feb 11, 2009 at 08:08:39PM +0000, Brian Candler wrote:
> On Tue, Feb 10, 2009 at 02:31:58PM -0800, James Marca wrote:
> > I have a situation where I want to run two different reduce functions
> > on the output of a single map function.  Like suppose I want one
> > reduce function to get the count of all objects in each group (for
> > example, documents with or without attachments), and another reduce to
> > compute some other aggregate, like the average and standard deviation
> > of a value, (like the average size of attached documents).  (Yes, I
> > know this is a stupid example, as the averaging reduce function will
> > also have the count, but my real case is too complicated to write
> > easily).
> 
> I believe reduce values are any JSON object, so perhaps you could reduce to
> an array of values, e.g. [count, total, sum_of_squares]
> 
> The final calculation of average and SD could then be left to the client

I'll have to think about what that means.  I've got mean/sd, etc
handled in a reduce, but I was wondering about doing other things with
the same map.  I am analyzing data from detectors for a year, with
most of the detectors reporting every 30 seconds.  So I want to say
things like "the average, std. dev, min and max for X on Tuesday between
8:05 and 8:10 was [...]"  That's one map/reduce run.  But there might be
other things we want to look at, so I was wondering whether it was
worth it to optimize a single map now (given the size of the data)
rather than adding more maps later. 

James

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

Re: how do I do different reduce operations on the same map

Posted by Brian Candler <B....@pobox.com>.

On Tue, Feb 10, 2009 at 02:31:58PM -0800, James Marca wrote:
> I have a situation where I want to run two different reduce functions
> on the output of a single map function.  Like suppose I want one
> reduce function to get the count of all objects in each group (for
> example, documents with or without attachments), and another reduce to
> compute some other aggregate, like the average and standard deviation
> of a value, (like the average size of attached documents).  (Yes, I
> know this is a stupid example, as the averaging reduce function will
> also have the count, but my real case is too complicated to write
> easily).

I believe reduce values are any JSON object, so perhaps you could reduce to
an array of values, e.g. [count, total, sum_of_squares]

The final calculation of average and SD could then be left to the client