You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@couchdb.apache.org by Michael Stillwell <mj...@beebo.org> on 2009/05/26 13:21:10 UTC

Re: Reduce limitations

On Sat, Apr 25, 2009 at 10:39 PM, Brian Candler <B....@pobox.com> wrote:
> Has anyone found a clear definition of the limitations of reduce functions,
> in terms of how large the reduced data may be?

In http://mail-archives.apache.org/mod_mbox/couchdb-user/200808.mbox/%3cEC4EAF59-78BA-4238-9827-B3561E3DC183@apache.org%3e,
Damian says:

"the size of the reduce value can be logarithmic with respect to the rows"

--M.

-- 
http://beebo.org

Re: Reduce limitations

Posted by Zachary Zolton <za...@gmail.com>.

I think it's worth putting in writing (perhaps on the wiki?) that
reduces functions are useful for *computations* over your data—not
trying to emulate SQL-style joins!

This was a hard lesson for me. I found I had to change my code
considerably to do more of that work on the client side, due to the
performance implications I found in a production environment.

Doing group=true queries of reduce views causes considerable work, as
all the work in the reduce steps must be repeated each time. Keep your
view code simple, especially if you're on EC2 small instances!

Cheers,
Zach

On Tue, Jun 2, 2009 at 4:04 AM, Brian Candler <B....@pobox.com> wrote:
> On Thu, May 28, 2009 at 12:54:16PM -0700, Chris Anderson wrote:
>> The deal is that if your reduce function's output is the same size as
>> its input, the final reduce value will end up being as large as all
>> the map rows put together.
>>
>> If your reduce function's output is 1/2 the size of it's input, you'll
>> also end up with quite a large amount of data in the final reduce. In
>> these cases each reduction stage actually accumulates more data, as it
>> is based on ever increasing numbers of map rows.
>>
>> If the function reduces data fast enough, the intermediate reduction
>> values will stay relatively constant, even as each reduce stage
>> reflects logarithmically more map rows. This is the kind of reduce
>> function you want.
>
> So actually, the requirement is that the final (root) result of the reduce
> process should be of a moderate size, and so should all the intermediate
> reduce values which comprise it. That makes sense.
>
> Depending on the type of reduce function you use, the "growth" may or may
> not be related to the number of documents which have been reduced together
> to form the reduce value.
>
> For example: an object which returns a map of {tag: count}, where the number
> of unique tags is bounded, may return a fairly large object when reduced
> across a small number of docs, but the final root reduce is no larger. So
> all you need to do is keep the number of tags into a 'reasonable' range
> (e.g. tens rather than thousands)
>
> Regards,
>
> Brian.
>

Re: Reduce limitations

Posted by Brian Candler <B....@pobox.com>.

On Thu, May 28, 2009 at 12:54:16PM -0700, Chris Anderson wrote:
> The deal is that if your reduce function's output is the same size as
> its input, the final reduce value will end up being as large as all
> the map rows put together.
> 
> If your reduce function's output is 1/2 the size of it's input, you'll
> also end up with quite a large amount of data in the final reduce. In
> these cases each reduction stage actually accumulates more data, as it
> is based on ever increasing numbers of map rows.
> 
> If the function reduces data fast enough, the intermediate reduction
> values will stay relatively constant, even as each reduce stage
> reflects logarithmically more map rows. This is the kind of reduce
> function you want.

So actually, the requirement is that the final (root) result of the reduce
process should be of a moderate size, and so should all the intermediate
reduce values which comprise it. That makes sense.

Depending on the type of reduce function you use, the "growth" may or may
not be related to the number of documents which have been reduced together
to form the reduce value.

For example: an object which returns a map of {tag: count}, where the number
of unique tags is bounded, may return a fairly large object when reduced
across a small number of docs, but the final root reduce is no larger. So
all you need to do is keep the number of tags into a 'reasonable' range
(e.g. tens rather than thousands)

Regards,

Brian.

Re: Reduce limitations

Posted by Chris Anderson <jc...@apache.org>.

On Thu, May 28, 2009 at 12:37 PM, Brian Candler <B....@pobox.com> wrote:
> On Tue, May 26, 2009 at 12:21:10PM +0100, Michael Stillwell wrote:
>> On Sat, Apr 25, 2009 at 10:39 PM, Brian Candler <B....@pobox.com> wrote:
>> > Has anyone found a clear definition of the limitations of reduce functions,
>> > in terms of how large the reduced data may be?
>>
>> In http://mail-archives.apache.org/mod_mbox/couchdb-user/200808.mbox/%3cEC4EAF59-78BA-4238-9827-B3561E3DC183@apache.org%3e,
>> Damian says:
>>
>> "the size of the reduce value can be logarithmic with respect to the rows"
>
> Which doesn't give any guidance as to the absolute maximum size of the
> reduce value, only how big it can get in relation to the number of rows,
> e.g.
>
>   max_reduce_object_size = k . ln(number of rows reduced)
>
> for some unknown k. Or is it ln(total size of rows reduced)?
>

The deal is that if your reduce function's output is the same size as
its input, the final reduce value will end up being as large as all
the map rows put together.

If your reduce function's output is 1/2 the size of it's input, you'll
also end up with quite a large amount of data in the final reduce. In
these cases each reduction stage actually accumulates more data, as it
is based on ever increasing numbers of map rows.

If the function reduces data fast enough, the intermediate reduction
values will stay relatively constant, even as each reduce stage
reflects logarithmically more map rows. This is the kind of reduce
function you want.

Theoretically, there are no hard limits, and theoretically, even the
first kind of function (identity on values, which leads to logarithmic
growth of intermediate values) could eventually complete even on a
large data set.

Practically, the first limit you'll hit is that all the input values
for the function will not fit in the JavaScript interpreter's memory
space. Even if that were not an issue, the function computation time
will likely go up logarithmically; similarly there will be slowdowns
in index processing as the reduction values are stored in the btree
inner-nodes. Shuffling around multi-gigabyte inner nodes is not
optimal and should be avoided.

I hope that's clear, let me know if I can make it clearer.

Chris

-- 
Chris Anderson
http://jchrisa.net
http://couch.io

Re: Reduce limitations

Posted by Brian Candler <B....@pobox.com>.

On Tue, May 26, 2009 at 12:21:10PM +0100, Michael Stillwell wrote:
> On Sat, Apr 25, 2009 at 10:39 PM, Brian Candler <B....@pobox.com> wrote:
> > Has anyone found a clear definition of the limitations of reduce functions,
> > in terms of how large the reduced data may be?
> 
> In http://mail-archives.apache.org/mod_mbox/couchdb-user/200808.mbox/%3cEC4EAF59-78BA-4238-9827-B3561E3DC183@apache.org%3e,
> Damian says:
> 
> "the size of the reduce value can be logarithmic with respect to the rows"

Which doesn't give any guidance as to the absolute maximum size of the
reduce value, only how big it can get in relation to the number of rows,
e.g.

   max_reduce_object_size = k . ln(number of rows reduced)

for some unknown k. Or is it ln(total size of rows reduced)?

Re: Reduce limitations

Posted by Jan Lehnardt <ja...@apache.org>.

On 26 May 2009, at 13:21, Michael Stillwell wrote:

> On Sat, Apr 25, 2009 at 10:39 PM, Brian Candler  
> <B....@pobox.com> wrote:
>> Has anyone found a clear definition of the limitations of reduce  
>> functions,
>> in terms of how large the reduced data may be?
>
> In http://mail-archives.apache.org/mod_mbox/couchdb-user/200808.mbox/%3cEC4EAF59-78BA-4238-9827-B3561E3DC183@apache.org%3e 
> ,
> Damian says:
>
> "the size of the reduce value can be logarithmic with respect to the  
> rows"

The wiki says the same.

Cheers
Jan
--