You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@couchdb.apache.org by Łukasz Mielicki <mi...@gmail.com> on 2011/02/23 19:12:54 UTC

reduce order vs descending

Hi everybody,

I can observe that order of values provided to reduce function depends
on descending query parameter.
Is this a bug?  Or does Couch support left and right reduce separetely?

Couchbook says: If you specify descending=true, the reading direction
is reversed, not the sort order of the rows in the view.

Also my assumption is the reduced values are always provided in key
order (and [key, _id] for first reduce step). Does it apply to all
reduce steps then?

Thanks,
Łukasz

Re: reduce order vs descending

Posted by Łukasz Mielicki <mi...@gmail.com>.

2011/2/23 Paul Davis <pa...@gmail.com>:
>>>> For: reduce (null, [key_range_A1_A2, key_range_B1_B2, key_range_C1_C2], true)
>>>> Does the relation A1 < A2 < B1 < B2 < C1 < C2 hold? Does it depend on
>>>> descending?
>>>> For: reduce ([k1,k2,k3], [v1,v2,v3], false)
>>>> Can I depend on k1 < k2 < k3 if descending=true?
[...]
>> However form bird's eye view basing on that one I cannot see a reason
>> not to preserve the first two (in terms of computational complexity
>> even with reduce parallelized). It seems to be only a matter of
>> preserving original b-tree partitioning in parameter passing.
>> Providing such guarantees would be beneficial in two ways:
>> - building a view would not depend on query parameters (sane)
>> - left/right-reduce semantics available (without sorting)

> I don't think any of those things should be guaranteed because that
> means we won't be able to change them at some point in the future if
> the need arises and I can't predict what changes we might need to make
> for various situations.

I have a different opinion on that, mostly because lack of these
guarantees adds overhead in reduce while could be mitigated at
negligible cost (not counting man-month cost of course : ). Not sure
if left/right reduce is unusual scenario in DB. I also think these
guarantees would make view properties consistent and easier to grasp.

> I'm also suddenly wondering how this would behave when run on a
> cluster with the reductions.

That would probably depends on data, but in my case I could allow
somewhat relaxed consistency, thus time based IDs (like
<timestamp|node|seq>) should work across cluster, though I'm sticking
with single node for now. I'm not afraid of data growth as number of
merged fields is limited in my case and I also explicitly trim
(useless) reduce output for excessive key ranges.

Thanks for your time!

Łukasz

Re: reduce order vs descending

Posted by Paul Davis <pa...@gmail.com>.

2011/2/23 Łukasz Mielicki <mi...@gmail.com>:
> On Wed, Feb 23, 2011 at 22:03, Paul Davis <pa...@gmail.com> wrote:
>
>>> For: reduce (null, [key_range_A1_A2, key_range_B1_B2, key_range_C1_C2], true)
>>> Does the relation A1 < A2 < B1 < B2 < C1 < C2 hold? Does it depend on
>>> descending?
>>> For: reduce ([k1,k2,k3], [v1,v2,v3], false)
>>> Can I depend on k1 < k2 < k3 if descending=true?
>
>> These are specifically the assertions I'm saying you can't rely on. In
>> your case you will absolutely need to sort the input data.
>
>> The only assumption I think its possible to make is that re-reductions
>> will always process reductions from contiguous parts of the b+tree.
>> I'm only at the *think* stage of certainty. I haven't gone through the
>> implementation to check if that's an absolute truth, and I can't yet
>> guarantee that it won't ever change if it is true.
>>
>> Just in case that doesn't make sense, imagine you have a set of
>> reductions that are from ordered regions of the view b+tree labeled as
>> such: A, B, C, D. That is to say, data in the reduction A is based
>> entirely of keys <= reduction keys in B <= reduction keys in C <=
>> reduction keys in D. Given that, the only assumption I think is
>> possible is that you are guaranteed that if A and C are passed
>> simultaneously to a re-reduce, then B is guaranteed to be passed as
>> well. In other words, you wont get a reduction graph that looks like
>> ((A | C) | (B | D)) which would break this sort of approach.
>
> Thanks for clarifying that.
> The latter assumption was so inherent to me I didn't even consider it.
> Don't think there would be much use of such reduce results at query
> time.
>
> However form bird's eye view basing on that one I cannot see a reason
> not to preserve the first two (in terms of computational complexity
> even with reduce parallelized). It seems to be only a matter of
> preserving original b-tree partitioning in parameter passing.
> Providing such guarantees would be beneficial in two ways:
> - building a view would not depend on query parameters (sane)
> - left/right-reduce semantics available (without sorting)
>
> What you think?

I don't think any of those things should be guaranteed because that
means we won't be able to change them at some point in the future if
the need arises and I can't predict what changes we might need to make
for various situations.

I'm also suddenly wondering how this would behave when run on a
cluster with the reductions.

Re: reduce order vs descending

Posted by Łukasz Mielicki <mi...@gmail.com>.

On Wed, Feb 23, 2011 at 22:03, Paul Davis <pa...@gmail.com> wrote:

>> For: reduce (null, [key_range_A1_A2, key_range_B1_B2, key_range_C1_C2], true)
>> Does the relation A1 < A2 < B1 < B2 < C1 < C2 hold? Does it depend on
>> descending?
>> For: reduce ([k1,k2,k3], [v1,v2,v3], false)
>> Can I depend on k1 < k2 < k3 if descending=true?

> These are specifically the assertions I'm saying you can't rely on. In
> your case you will absolutely need to sort the input data.

> The only assumption I think its possible to make is that re-reductions
> will always process reductions from contiguous parts of the b+tree.
> I'm only at the *think* stage of certainty. I haven't gone through the
> implementation to check if that's an absolute truth, and I can't yet
> guarantee that it won't ever change if it is true.
>
> Just in case that doesn't make sense, imagine you have a set of
> reductions that are from ordered regions of the view b+tree labeled as
> such: A, B, C, D. That is to say, data in the reduction A is based
> entirely of keys <= reduction keys in B <= reduction keys in C <=
> reduction keys in D. Given that, the only assumption I think is
> possible is that you are guaranteed that if A and C are passed
> simultaneously to a re-reduce, then B is guaranteed to be passed as
> well. In other words, you wont get a reduction graph that looks like
> ((A | C) | (B | D)) which would break this sort of approach.

Thanks for clarifying that.
The latter assumption was so inherent to me I didn't even consider it.
Don't think there would be much use of such reduce results at query
time.

However form bird's eye view basing on that one I cannot see a reason
not to preserve the first two (in terms of computational complexity
even with reduce parallelized). It seems to be only a matter of
preserving original b-tree partitioning in parameter passing.
Providing such guarantees would be beneficial in two ways:
- building a view would not depend on query parameters (sane)
- left/right-reduce semantics available (without sorting)

What you think?

Regards,
Łukasz

Re: reduce order vs descending

Posted by Paul Davis <pa...@gmail.com>.

> Right, but it seems to work without sorting in reduce if descending is
> true. Although haven't tested on larger data set.
> I would like to know if I can omit sorting entirely in this case with
> same result, that is:
>
> For: reduce (null, [key_range_A1_A2, key_range_B1_B2, key_range_C1_C2], true)
> Does the relation A1 < A2 < B1 < B2 < C1 < C2 hold? Does it depend on
> descending?
>
> For: reduce ([k1,k2,k3], [v1,v2,v3], false)
> Can I depend on k1 < k2 < k3 if descending=true?

These are specifically the assertions I'm saying you can't rely on. In
your case you will absolutely need to sort the input data.

The only assumption I think its possible to make is that re-reductions
will always process reductions from contiguous parts of the b+tree.
I'm only at the *think* stage of certainty. I haven't gone through the
implementation to check if that's an absolute truth, and I can't yet
guarantee that it won't ever change if it is true.

Just in case that doesn't make sense, imagine you have a set of
reductions that are from ordered regions of the view b+tree labeled as
such: A, B, C, D. That is to say, data in the reduction A is based
entirely of keys <= reduction keys in B <= reduction keys in C <=
reduction keys in D. Given that, the only assumption I think is
possible is that you are guaranteed that if A and C are passed
simultaneously to a re-reduce, then B is guaranteed to be passed as
well. In other words, you wont get a reduction graph that looks like
((A | C) | (B | D)) which would break this sort of approach.

Re: reduce order vs descending

Posted by Łukasz Mielicki <mi...@gmail.com>.

Thanks, my view is something like this:

map: function(doc) {
  emit(doc.group, doc);
},
reduce: function(keys, values) {
  values.sort(function(x, y) { return (x._id < y._id) ? -1 : 1;});
  var result = {};
  for (var i = 0; i < values.length; ++i) {
    var doc = values[i];
    for (var k in doc) result[k] = doc[k];
  }
  return result;
}

Documents are incremental updates and _id is guaranteed to increase with time.
For all documents with same key (let's say group) documents are
combined in the way fields of document with higher _id override fields
of documents with previous _id.
I query only in a range of single key/group. I plan to throw or
generate empty result if reduce is applied to docs with adjacent key
to save on view size.

Is a sort stop and preserving _id necessary here?
(descending=true&group=true on query).

Regards,
Łukasz


2011/2/23 Paul Davis <pa...@gmail.com>:
> 2011/2/23 Łukasz Mielicki <mi...@gmail.com>:
>> 2011/2/23 Paul Davis <pa...@gmail.com>:
>>> 2011/2/23 Łukasz Mielicki <mi...@gmail.com>:
>>>> Hi everybody,
>>>>
>>>> I can observe that order of values provided to reduce function depends
>>>> on descending query parameter.
>>>> Is this a bug?  Or does Couch support left and right reduce separetely?
>>
>>> Never write code that depends on the ordering of keys in reductions.
>>> It will break. The order is undefined and I don't ever see it being
>>> defined due to lots and lots of edge cases in that code.
>>
>> I pretty confident about my use-case properties. It is similar to
>> finding max value(s), but in my case key/id matters because they
>> contain information (I'm not using UUIDs). The remaining questions is
>> if do I have to sort them in reduce step given descending is always
>> true.
>>
>> IMHO its vital to be able to get both left-reduce and right-reduce
>> semantics over key range.
>>
>> Thanks,
>> Łukasz
>>
>
> I'm saying that you can't rely on the ordering of keys passed to your
> reduce functions. Which is not the same as relying on the order which
> rich reductions are combined.
>
> If you post some example data and a description of what you'd like to
> accomplish I might be able to help more concretely on your specific
> case.
>

Re: reduce order vs descending

Posted by Paul Davis <pa...@gmail.com>.

2011/2/23 Łukasz Mielicki <mi...@gmail.com>:
> 2011/2/23 Paul Davis <pa...@gmail.com>:
>> 2011/2/23 Łukasz Mielicki <mi...@gmail.com>:
>>> Hi everybody,
>>>
>>> I can observe that order of values provided to reduce function depends
>>> on descending query parameter.
>>> Is this a bug?  Or does Couch support left and right reduce separetely?
>
>> Never write code that depends on the ordering of keys in reductions.
>> It will break. The order is undefined and I don't ever see it being
>> defined due to lots and lots of edge cases in that code.
>
> I pretty confident about my use-case properties. It is similar to
> finding max value(s), but in my case key/id matters because they
> contain information (I'm not using UUIDs). The remaining questions is
> if do I have to sort them in reduce step given descending is always
> true.
>
> IMHO its vital to be able to get both left-reduce and right-reduce
> semantics over key range.
>
> Thanks,
> Łukasz
>

I'm saying that you can't rely on the ordering of keys passed to your
reduce functions. Which is not the same as relying on the order which
rich reductions are combined.

If you post some example data and a description of what you'd like to
accomplish I might be able to help more concretely on your specific
case.

Re: reduce order vs descending

Posted by Łukasz Mielicki <mi...@gmail.com>.

2011/2/23 Paul Davis <pa...@gmail.com>:
> 2011/2/23 Łukasz Mielicki <mi...@gmail.com>:
>> Hi everybody,
>>
>> I can observe that order of values provided to reduce function depends
>> on descending query parameter.
>> Is this a bug?  Or does Couch support left and right reduce separetely?

> Never write code that depends on the ordering of keys in reductions.
> It will break. The order is undefined and I don't ever see it being
> defined due to lots and lots of edge cases in that code.

I pretty confident about my use-case properties. It is similar to
finding max value(s), but in my case key/id matters because they
contain information (I'm not using UUIDs). The remaining questions is
if do I have to sort them in reduce step given descending is always
true.

IMHO its vital to be able to get both left-reduce and right-reduce
semantics over key range.

Thanks,
Łukasz

Re: reduce order vs descending

Posted by Paul Davis <pa...@gmail.com>.

2011/2/23 Łukasz Mielicki <mi...@gmail.com>:
> Hi everybody,
>
> I can observe that order of values provided to reduce function depends
> on descending query parameter.
> Is this a bug?  Or does Couch support left and right reduce separetely?
>
> Couchbook says: If you specify descending=true, the reading direction
> is reversed, not the sort order of the rows in the view.
>
> Also my assumption is the reduced values are always provided in key
> order (and [key, _id] for first reduce step). Does it apply to all
> reduce steps then?
>
> Thanks,
> Łukasz
>

Never write code that depends on the ordering of keys in reductions.
It will break. The order is undefined and I don't ever see it being
defined due to lots and lots of edge cases in that code.