You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@couchdb.apache.org by hhsuper <hh...@gmail.com> on 2009/06/22 15:49:15 UTC

Re: 'Grouping' documents so that a set of documents is passed to the view function

Brian's decription for reduce function is clearly, but i think you can
achieve your goal as below:

function(doc){
  emit(doc.group_key, doc)
}

function(keys,values,rereduce){
  //...
}

with the group=true option you can impl doc process by group_key in reduce
function,
as your example reduce will be invoke two times with values:
[firstdoc,seconddoc] which groupkey=1.... [thirddoc,forthdoc]  groupkey=2,
is that enough?

On Mon, Jun 22, 2009 at 5:07 PM, Brian Candler <B....@pobox.com> wrote:

> On Fri, Jun 19, 2009 at 09:43:31AM +0200, Daniel Trümper wrote:
> > Hi,
> >
> > I am somewhat new to CouchDB but have been doing some stuff with it and
> > this is my first post to the list so pardon if I am wrong :)
> >
> >
> >> It would be really cool if there were some way to pass all the docs
> >> with a value of 1 for group_key to a single map function call, so I
> >> could do computation across those related documents and emit the
> >> results ...  I'm just using the magic group_key attribute as an
> >> example, if such a feature were to actually be made I'd think you'd
> >> define a javascript function which returned a single groupping k to
> >> exist I
> > I think this is what the reduce function is for.
>
> No, I'm afraid it's not.
>
> The OP wants to calculate information across a group of related documents.
> CouchDB does not guarantee that all the related documents will be passed to
> the reduce function at the same time. It may pass documents (d1,d2,d3) to
> the reduce function to generate Rx, then pass (d4,d5,d6) to the reduce
> function to generate Ry, then (d7,d8,d9) to generate Rz, then pass
> (Rx,Ry,Rz) to the re-reduce function to generate the final R value.
>
> If the values sharing the key were e.g. d3,d4 then you won't be able to
> process them together, as they would not be presented to the reduce
> function
> at the same time.
>
> Using a grouped reduce query is better (i.e. group=true), but a large set
> of
> documents sharing the same group key are still likely to be split into
> several reductions with a re-reduce. The OP was talking about ~100
> documents
> sharing this key, and so they may well be split this way.
>
> Furthermore, CouchDB optimises its reductions by storing the reduced value
> for all the documents within the same Btree node. For example, suppose you
> have
>
>   +-------------+  +-------------+  +-------------+
>   | d1 d2 d3 Rx |  | d4 d5 d6 Ry |  | d7 d8 d9 Rz |
>   +-------------+  +-------------+  +-------------+
>
> Then you make a reduce query for the key range which includes documents d2
> to d8 inclusive (or a grouped query where d2 to d8 share the same group
> key). CouchDB will calculate:
>
>  R1 = Reduce(d2,d3)
>  R2 = Reduce(d7,d8)
>  R  = Rereduce(R1,Ry,R2)
>
> That is: the already-reduced value of Ry=Reduce(d4,d5,d6) is reused without
> recomputation. So the reduce function doesn't see documents d4 to d6 again.
>
> So in summary: you cannot rely on the reduce function to be able to process
> adjacent documents. You *must* do this sort of processing client-side.
>
> HTH,
>
> Brian.
>



-- 
Yours sincerely

Jack Su

Re: Re: 'Grouping' documents so that a set of documents is passed to the view function

Posted by th...@gmail.com.

Brian,

Could a better approach be to use an ETL tool such as Jasper or Pentaho to  
perform the analysis and either use them to display or load another CouchDB  
or MySQL instance for viewing the data ? I would instead opt to perform  
analysis with a Business Intelligence suite, unless the reporting solution  
needs are fairly basic, since he mentioned growth would be a factor. Growth  
could be accounted for within a Data Warehouse, separate from CouchDB, or  
kept inside it. Something to consider. But my background is in Enterprise  
architectures & scaling.

-Thad

On Jun 26, 2009 4:34am, Brian Candler <B....@pobox.com> wrote:
> On Fri, Jun 26, 2009 at 05:13:18PM +0800, hhsuper wrote:

> > by the way, we select to use couchdb as a part our current application,

> > because that part of data is huge and with everyday got growth, we need

> > do some analysis on these data and then give the result back to the

> > user( though our website), maybe i need consider another solution:

> > analysis still with couchdb view, but not query that directly to the

> > web, we maybe give a cron job sync the analysis(couchdb view) data to a

> > mysql table and then use the result data in mysql table to feed the

> > webpage, that is easy to paging and sorting at mysql level, this is a

> > rough thought only.

> Yep, we seem to have had the same idea. Or maybe it may make sense to

> perform the analysis or partial analysis when exporting, and stick the

> summarised data into another couchdb database which is searched via the  
> web.

Re: 'Grouping' documents so that a set of documents is passed to the view function

Posted by Brian Candler <B....@pobox.com>.

On Fri, Jun 26, 2009 at 05:13:18PM +0800, hhsuper wrote:
>    by the way, we select to use couchdb as a part our current application,
>    because that part of data is huge and with everyday got growth, we need
>    do some analysis on these data and then give the result back to the
>    user( though our website), maybe i need consider another solution:
>    analysis still with couchdb view, but not query that directly to the
>    web, we maybe give a cron job sync the analysis(couchdb view) data to a
>    mysql table and then use the result data in mysql table to feed the
>    webpage, that is easy to paging and sorting at mysql level, this is a
>    rough thought only.

Yep, we seem to have had the same idea. Or maybe it may make sense to
perform the analysis or partial analysis when exporting, and stick the
summarised data into another couchdb database which is searched via the web.

Re: 'Grouping' documents so that a set of documents is passed to the view function

Posted by hhsuper <hh...@gmail.com>.

by the way, we select to use couchdb as a part our current application,
because that part of data is huge and with everyday got growth, we need do
some analysis on these data and then give the result back to the user(
though our website), maybe i need consider another solution: analysis still
with couchdb view, but not query that directly to the web, we maybe give a
cron job sync the analysis(couchdb view) data to a mysql table and then use
the result data in mysql table to feed the webpage, that is easy to paging
and sorting at mysql level, this is a rough thought only.

On Fri, Jun 26, 2009 at 5:01 PM, hhsuper <hh...@gmail.com> wrote:

> Thx brian for your patient and great help, maybe it doesn't suitable for my
> requirement, because it's so difficult to impl my common
> requirement(paging/sorting on couchdb level), such as:
>
> i need output from view( or multi-view):
> key:[col1,clo2,clo3],values:["v1":23, "v2":7]
>
> by the way caculated in reduce function, can you give a example how to impl
> sorting by any of the col1/col2/col3/v1/v2 column value with couchdb
> (multi-view also accepted), because the sorting is based the all data when
> with million records it's unable to load them to client and then sorting, so
> the couchdb level sorting is needed to me.
>
> also paging is the same thing, when i impl sorting/paging  together in
> couchdb level, i need consider the startkey for descending param, it's
> especially difficult (i impl paging with couchdb's startkey and limit param
> now).
>
>
> On Fri, Jun 26, 2009 at 3:15 PM, Brian Candler <B....@pobox.com>wrote:
>
>> On Fri, Jun 26, 2009 at 01:32:44PM +0800, hhsuper wrote:
>> >    also the sorting need to be support
>> >    with any col of the returned keys and returned values, but now it's
>> >    difficult to impl that with couchdb, isn't it?
>>
>> You need multiple views with couchdb, just as with an RDBMS you need
>> multiple indexes. (Well, with the RDBMS you don't *need* multiple indexes,
>> but if you don't have them, the query will end up doing whole table scans,
>> which may result in excessive amounts of slow disk I/O. In other words,
>> you're hiding the problem and storing it up to bite you later)
>>
>
>
>
> --
> Yours sincerely
>
> Jack Su
>



-- 
Yours sincerely

Jack Su

Re: 'Grouping' documents so that a set of documents is passed to the view function

Posted by Brian Candler <B....@pobox.com>.

On Fri, Jun 26, 2009 at 05:01:37PM +0800, hhsuper wrote:
>    by the way caculated in reduce function, can you give a example how to
>    impl sorting by any of the col1/col2/col3/v1/v2 column value with
>    couchdb (multi-view also accepted), because the sorting is based the
>    all data

Sorting is done by map views:

  emit(some_value, ...);

When you access the view, it is sorted by key. If you want to sort by
different columns, then either have multiple views, or a single tagged view:

  emit(["by_foo", doc.foo||null], null);
  emit(["by_bar", doc.bar||null], null);
  emit(["by_baz", doc.baz||null], null);

If you want to select a subset of data (based on attribute foo), and then sort
by attribute bar, use compound keys:

  emit([doc.foo||null, doc.bar||null]);

Then access using startkey=["foo1"]&endkey=["foo1",{}]. This is basically
the same as composite indexes in an RDBMS, although more flexible because
you can use derived data for the indexes, rather than just the raw data from
your 'columns'.

If there is filtering that can't be done in a map function, then you do it
client side. Ditto for sorting after the filtering.

Reduce views by default return only one value, so sorting doesn't mean
anything. But if you do a grouped reduce view, then the values are sorted by
the group key.

>    when with million records it's unable to load them to client
>    and then sorting, so the couchdb level sorting is needed to me.

Is it not just as unacceptable to ask your RDBMS to fetch one million
records and sort them before returning a handful of them to the client? I
think you are just hiding your problem. Maybe your RDBMS is clever enough to
keep the one million sorted records cached ready for the next query for the
next handful of records... but maybe it is not.

This, in my opinion, is one of the best things about CouchDB. It forces you
to be explicit about your indexing and searching algorithms, and exposes the
weak points for you.

Anyway, unless you are experiencing a problem right now, why worry about it?
Remember "You Ain't Gonna Need It" and "Do The Simplest Thing Which Can
Possibly Work". Build your app and see how it performs. If you reach a
scaling problem which you absolutely cannot deal with under CouchDB, it's
easy to export the data and move it into something else. Or you can take
periodic snapshots of your data and load them into some other app, whilst
keeping CouchDB for loading, editing and searching.

>    also paging is the same thing, when i impl sorting/paging  together in
>    couchdb level, i need consider the startkey for descending param, it's
>    especially difficult (i impl paging with couchdb's startkey and limit
>    param now).

That's basically the right way to do paging. It's easy to do "jump to next
page" and "jump to previous page". If you really want to do "skip to page N"
then you can use skip and limit instead, which is less efficient as you get
further away from the start. This is exactly the same inefficiency you would
get using offset and limit in a SQL query, of course. In other words:
CouchDB exposes a highly efficient way to do paging which SQL doesn't offer.

Anyway, I think I've reached my limit on this conversation. If you are
convinced that CouchDB is not what you want, then please go ahead and use
something else, rather than trying to convince us why CouchDB is wrong for
your application :-)

Regards,

Brian.

Re: 'Grouping' documents so that a set of documents is passed to the view function

Posted by hhsuper <hh...@gmail.com>.

Thx brian for your patient and great help, maybe it doesn't suitable for my
requirement, because it's so difficult to impl my common
requirement(paging/sorting on couchdb level), such as:

i need output from view( or multi-view):
key:[col1,clo2,clo3],values:["v1":23, "v2":7]

by the way caculated in reduce function, can you give a example how to impl
sorting by any of the col1/col2/col3/v1/v2 column value with couchdb
(multi-view also accepted), because the sorting is based the all data when
with million records it's unable to load them to client and then sorting, so
the couchdb level sorting is needed to me.

also paging is the same thing, when i impl sorting/paging  together in
couchdb level, i need consider the startkey for descending param, it's
especially difficult (i impl paging with couchdb's startkey and limit param
now).

On Fri, Jun 26, 2009 at 3:15 PM, Brian Candler <B....@pobox.com> wrote:

> On Fri, Jun 26, 2009 at 01:32:44PM +0800, hhsuper wrote:
> >    also the sorting need to be support
> >    with any col of the returned keys and returned values, but now it's
> >    difficult to impl that with couchdb, isn't it?
>
> You need multiple views with couchdb, just as with an RDBMS you need
> multiple indexes. (Well, with the RDBMS you don't *need* multiple indexes,
> but if you don't have them, the query will end up doing whole table scans,
> which may result in excessive amounts of slow disk I/O. In other words,
> you're hiding the problem and storing it up to bite you later)
>

-- 
Yours sincerely

Jack Su

Re: 'Grouping' documents so that a set of documents is passed to the view function

Posted by Brian Candler <B....@pobox.com>.

On Fri, Jun 26, 2009 at 01:32:44PM +0800, hhsuper wrote:
>    also the sorting need to be support
>    with any col of the returned keys and returned values, but now it's
>    difficult to impl that with couchdb, isn't it?

You need multiple views with couchdb, just as with an RDBMS you need
multiple indexes. (Well, with the RDBMS you don't *need* multiple indexes,
but if you don't have them, the query will end up doing whole table scans,
which may result in excessive amounts of slow disk I/O. In other words,
you're hiding the problem and storing it up to bite you later)

Re: 'Grouping' documents so that a set of documents is passed to the view function

Posted by hhsuper <hh...@gmail.com>.

Great job, Brain, I saw you have update the wiki
http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views with the
discuss content, it'll great help for many people

On Fri, Jun 26, 2009 at 1:32 PM, hhsuper <hh...@gmail.com> wrote:

> by the way, brian, i see many people say that with couchdb, you can
> download the data from view and then do sorting/filting/pagination in
> client, yes you can do that but i think we absolutely need these feature in
> couchdb level just like when we use rdbms it support that, such as when i
> have a view with million records i absolutely need paging/sorting on couchdb
> level, also the sorting need to be support with any col of the returned keys
> and returned values, but now it's difficult to impl that with couchdb, isn't
> it?
>
>
> On Fri, Jun 26, 2009 at 10:39 AM, hhsuper <hh...@gmail.com> wrote:
>
>> Thx Brian again, I totally understand your description about the
>> map/reduce realistic scenario, this is just what i worry about my view, my
>> reduce function is non-linear, actually i need the logic in reduce part and
>> the that re-reduce code abviouse wrong, but when the realistic reduce
>> occurred, like you say "t's quite possible given N documents that couchdb
>> will reduce the first N-1, then reduce the last 1" and my logic isn't be run
>>
>> seem difficult to execute this logic in reduce, except i return a large
>> object( which i don't like ) to make sure i can execute the same logic in
>> re-reduce part, i can return the additional value which hold the
>> {'dialogid':bestScore,....} in the reduce function and that could make sure
>> i can execute the same logic in re-reduce part, but when user studied dialog
>> more and more, the reduce value got lager and larger
>>
>> should i get  a conclusion that logic like this isn't proper to implement
>> in couchdb's view?
>> also as you say i can download all data to client to  caculate, but this
>> is very costly and have scalable problem.
>>
>> > I don't really understand why you need a subquery in rdbms. I would just
>> > select all results where uid=x, and process them as required (for
>> example:
>> > build a hash of dialogid=>bestScore and update it from each received
>> row)
>>
>> oh, maybe i don't descripe clearly, with a subquery i can used only one
>> sql to get all the user's result( i impl a scoreboard) without any other
>> program code, and within query i can impl pagination(physic paging) and
>> sorting,
>>
>> On Thu, Jun 25, 2009 at 5:08 PM, Brian Candler <B....@pobox.com>wrote:
>>
>>> On Thu, Jun 25, 2009 at 09:34:31AM +0100, Brian Candler wrote:
>>> > Perhaps it will help you to understand this if you consider the
>>> limiting
>>> > case where exactly one document is fed into the 'reduce' function at a
>>> time,
>>> > and then the outputs of the reduce functions are combined with a large
>>> > re-reduce phase.
>>>
>>> Incidentally, this is a partly realistic scenario. It's quite possible
>>> given
>>> N documents that couchdb will reduce the first N-1, then reduce the last
>>> 1,
>>> then re-reduce those two values. This might be because of how the
>>> documents
>>> are split between Btree nodes, or there may be a limit on the number of
>>> documents passed to the reduce function in one go. This is entirely an
>>> implementation issue which you have no control over, so you must write
>>> your
>>> reduce/rereduce to give the same answer for *any* partitioning of
>>> documents.
>>>
>>> More info at
>>> http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views
>>>
>>> "To make incremental Map/Reduce possible, the Reduce function has the
>>> requirement that not only must it be referentially transparent, but it
>>> must
>>> also be commutative and associative for the array value input, to be able
>>> reduce on its own output and get the same answer, like this:
>>>
>>> f(Key, Values) == f(Key, [ f(Key, Values) ] )"
>>>
>>> Now, at first glance your re-reduce function appears to satisfy that
>>> condition, so perhaps there should be another one: namely, that for any
>>> partitioning of Values into subsets Values1, Values2, ... then
>>>
>>>  f(Key, Values) == f(Key, [ f(Key,Values1), f(Key,Values2), ... ] )
>>>
>>> But I am not a mathematician so I'm not sure if this condition is
>>> actually
>>> stronger.
>>>
>>> Regards,
>>>
>>> Brian.
>>>
>>
>>
>>
>> --
>> Yours sincerely
>>
>> Jack Su
>>
>
>
>
> --
> Yours sincerely
>
> Jack Su
>



-- 
Yours sincerely

Jack Su

Re: 'Grouping' documents so that a set of documents is passed to the view function

Posted by hhsuper <hh...@gmail.com>.

by the way, brian, i see many people say that with couchdb, you can download
the data from view and then do sorting/filting/pagination in client, yes you
can do that but i think we absolutely need these feature in couchdb level
just like when we use rdbms it support that, such as when i have a view with
million records i absolutely need paging/sorting on couchdb level, also the
sorting need to be support with any col of the returned keys and returned
values, but now it's difficult to impl that with couchdb, isn't it?

On Fri, Jun 26, 2009 at 10:39 AM, hhsuper <hh...@gmail.com> wrote:

> Thx Brian again, I totally understand your description about the map/reduce
> realistic scenario, this is just what i worry about my view, my reduce
> function is non-linear, actually i need the logic in reduce part and the
> that re-reduce code abviouse wrong, but when the realistic reduce occurred,
> like you say "t's quite possible given N documents that couchdb will reduce
> the first N-1, then reduce the last 1" and my logic isn't be run
>
> seem difficult to execute this logic in reduce, except i return a large
> object( which i don't like ) to make sure i can execute the same logic in
> re-reduce part, i can return the additional value which hold the
> {'dialogid':bestScore,....} in the reduce function and that could make sure
> i can execute the same logic in re-reduce part, but when user studied dialog
> more and more, the reduce value got lager and larger
>
> should i get  a conclusion that logic like this isn't proper to implement
> in couchdb's view?
> also as you say i can download all data to client to  caculate, but this is
> very costly and have scalable problem.
>
> > I don't really understand why you need a subquery in rdbms. I would just
> > select all results where uid=x, and process them as required (for
> example:
> > build a hash of dialogid=>bestScore and update it from each received row)
>
> oh, maybe i don't descripe clearly, with a subquery i can used only one sql
> to get all the user's result( i impl a scoreboard) without any other program
> code, and within query i can impl pagination(physic paging) and sorting,
>
> On Thu, Jun 25, 2009 at 5:08 PM, Brian Candler <B....@pobox.com>wrote:
>
>> On Thu, Jun 25, 2009 at 09:34:31AM +0100, Brian Candler wrote:
>> > Perhaps it will help you to understand this if you consider the limiting
>> > case where exactly one document is fed into the 'reduce' function at a
>> time,
>> > and then the outputs of the reduce functions are combined with a large
>> > re-reduce phase.
>>
>> Incidentally, this is a partly realistic scenario. It's quite possible
>> given
>> N documents that couchdb will reduce the first N-1, then reduce the last
>> 1,
>> then re-reduce those two values. This might be because of how the
>> documents
>> are split between Btree nodes, or there may be a limit on the number of
>> documents passed to the reduce function in one go. This is entirely an
>> implementation issue which you have no control over, so you must write
>> your
>> reduce/rereduce to give the same answer for *any* partitioning of
>> documents.
>>
>> More info at http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views
>>
>> "To make incremental Map/Reduce possible, the Reduce function has the
>> requirement that not only must it be referentially transparent, but it
>> must
>> also be commutative and associative for the array value input, to be able
>> reduce on its own output and get the same answer, like this:
>>
>> f(Key, Values) == f(Key, [ f(Key, Values) ] )"
>>
>> Now, at first glance your re-reduce function appears to satisfy that
>> condition, so perhaps there should be another one: namely, that for any
>> partitioning of Values into subsets Values1, Values2, ... then
>>
>>  f(Key, Values) == f(Key, [ f(Key,Values1), f(Key,Values2), ... ] )
>>
>> But I am not a mathematician so I'm not sure if this condition is actually
>> stronger.
>>
>> Regards,
>>
>> Brian.
>>
>
>
>
> --
> Yours sincerely
>
> Jack Su
>



-- 
Yours sincerely

Jack Su

Re: 'Grouping' documents so that a set of documents is passed to the view function

Posted by Brian Candler <B....@pobox.com>.

On Fri, Jun 26, 2009 at 10:39:06AM +0800, hhsuper wrote:
>    seem difficult to execute this logic in reduce, except i return a large
>    object( which i don't like ) to make sure i can execute the same logic
>    in re-reduce part, i can return the additional value which hold the
>    {'dialogid':bestScore,....} in the reduce function and that could make
>    sure i can execute the same logic in re-reduce part, but when user
>    studied dialog more and more, the reduce value got lager and larger
>    should i get  a conclusion that logic like this isn't proper to
>    implement in couchdb's view?

That's what I've been trying to say :-)

>    also as you say i can download all data to client to  caculate, but
>    this is very costly and have scalable problem.

I don't think this is necessarily the case. Downloading 1,000 small
documents as a single JSON object in a single HTTP request is cheap. And as
for calculation, this is *better* done at the client, because it's easier to
scale with multiple clients on a single database, than multiple databases.

But if you don't think CouchDB is a good fit for your application - that is,
if you can find something more suitable - then I would advise you to use
that instead.

Re: 'Grouping' documents so that a set of documents is passed to the view function

Posted by hhsuper <hh...@gmail.com>.

Thx Brian again, I totally understand your description about the map/reduce
realistic scenario, this is just what i worry about my view, my reduce
function is non-linear, actually i need the logic in reduce part and the
that re-reduce code abviouse wrong, but when the realistic reduce occurred,
like you say "t's quite possible given N documents that couchdb will reduce
the first N-1, then reduce the last 1" and my logic isn't be run

seem difficult to execute this logic in reduce, except i return a large
object( which i don't like ) to make sure i can execute the same logic in
re-reduce part, i can return the additional value which hold the
{'dialogid':bestScore,....} in the reduce function and that could make sure
i can execute the same logic in re-reduce part, but when user studied dialog
more and more, the reduce value got lager and larger

should i get  a conclusion that logic like this isn't proper to implement in
couchdb's view?
also as you say i can download all data to client to  caculate, but this is
very costly and have scalable problem.

> I don't really understand why you need a subquery in rdbms. I would just
> select all results where uid=x, and process them as required (for example:
> build a hash of dialogid=>bestScore and update it from each received row)

oh, maybe i don't descripe clearly, with a subquery i can used only one sql
to get all the user's result( i impl a scoreboard) without any other program
code, and within query i can impl pagination(physic paging) and sorting,

On Thu, Jun 25, 2009 at 5:08 PM, Brian Candler <B....@pobox.com> wrote:

> On Thu, Jun 25, 2009 at 09:34:31AM +0100, Brian Candler wrote:
> > Perhaps it will help you to understand this if you consider the limiting
> > case where exactly one document is fed into the 'reduce' function at a
> time,
> > and then the outputs of the reduce functions are combined with a large
> > re-reduce phase.
>
> Incidentally, this is a partly realistic scenario. It's quite possible
> given
> N documents that couchdb will reduce the first N-1, then reduce the last 1,
> then re-reduce those two values. This might be because of how the documents
> are split between Btree nodes, or there may be a limit on the number of
> documents passed to the reduce function in one go. This is entirely an
> implementation issue which you have no control over, so you must write your
> reduce/rereduce to give the same answer for *any* partitioning of
> documents.
>
> More info at http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views
>
> "To make incremental Map/Reduce possible, the Reduce function has the
> requirement that not only must it be referentially transparent, but it must
> also be commutative and associative for the array value input, to be able
> reduce on its own output and get the same answer, like this:
>
> f(Key, Values) == f(Key, [ f(Key, Values) ] )"
>
> Now, at first glance your re-reduce function appears to satisfy that
> condition, so perhaps there should be another one: namely, that for any
> partitioning of Values into subsets Values1, Values2, ... then
>
>  f(Key, Values) == f(Key, [ f(Key,Values1), f(Key,Values2), ... ] )
>
> But I am not a mathematician so I'm not sure if this condition is actually
> stronger.
>
> Regards,
>
> Brian.
>



-- 
Yours sincerely

Jack Su

Re: 'Grouping' documents so that a set of documents is passed to the view function

Posted by Brian Candler <B....@pobox.com>.

On Thu, Jun 25, 2009 at 09:34:31AM +0100, Brian Candler wrote:
> Perhaps it will help you to understand this if you consider the limiting
> case where exactly one document is fed into the 'reduce' function at a time,
> and then the outputs of the reduce functions are combined with a large
> re-reduce phase.

Incidentally, this is a partly realistic scenario. It's quite possible given
N documents that couchdb will reduce the first N-1, then reduce the last 1,
then re-reduce those two values. This might be because of how the documents
are split between Btree nodes, or there may be a limit on the number of
documents passed to the reduce function in one go. This is entirely an
implementation issue which you have no control over, so you must write your
reduce/rereduce to give the same answer for *any* partitioning of documents.

More info at http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views

"To make incremental Map/Reduce possible, the Reduce function has the
requirement that not only must it be referentially transparent, but it must
also be commutative and associative for the array value input, to be able
reduce on its own output and get the same answer, like this:

f(Key, Values) == f(Key, [ f(Key, Values) ] )"

Now, at first glance your re-reduce function appears to satisfy that
condition, so perhaps there should be another one: namely, that for any
partitioning of Values into subsets Values1, Values2, ... then

  f(Key, Values) == f(Key, [ f(Key,Values1), f(Key,Values2), ... ] )

But I am not a mathematician so I'm not sure if this condition is actually
stronger.

Regards,

Brian.

Re: 'Grouping' documents so that a set of documents is passed to the view function

Posted by Brian Candler <B....@pobox.com>.

On Thu, Jun 25, 2009 at 09:24:31AM +0800, hhsuper wrote:
>    I descripe the application scenario carefully: when user learn from one
>    dialog, they start a session( sessionid), the study on every line in
>    dialog generate a couchdb document(there are uid/dialogid/sessionid,
>    wordcount/weightedScore/grade for the line), the user could re-study
>    the same dialog some days later, so they start a new session but for
>    the same dialog, we want get every user's average grade from their
>    study results(dialog as unit, so we need sum for specified session) but
>    for the same dialog we only want to use the highest grade of  session
>    not use all session

This sounds exactly like the sort of logic which should be implemented in
the client. Download all the user's results, process them, display them.

>    this seem to difficult to impl with one view,  as impl in rdbms we need
>    build sql query on a subquery(or on a db view), is that proper to impl
>    with couchdb's view?

I don't really understand why you need a subquery in rdbms. I would just
select all results where uid=x, and process them as required (for example:
build a hash of dialogid=>bestScore and update it from each received row)

>    function code bellow(code maybe already complex), by the way when you
>    said "root node with *all* the uids", i think i don't very clearly
>    about the view's internal store structure and i can't find in wiki:

Good info in this blog posting (prob. should be linked from Wiki):
http://horicky.blogspot.com/2008/10/couchdb-implementation.html

A reduce value can be any Javascript value (number, string, array or
object). Some people have tried to build summary objects of the form
   {
    uid1: [some data],
    uid2: [some more data],
    uid3: [even more data],
    ...
   }

The problem here is that if you have a million uids, the reduce value will
be an object with a million members. And the reduce value is *stored* in the
root btree node. In fact, every btree node stores the reduce value for the
documents in that node and its children. This means reduce will become
ridiculously slow.

However, in your reduce function, you are always reducing to a single object
with three members, which is fine. The reduce value in the root node will be
a reduce calculated across all users, which may or may not mean anything,
but doesn't do any harm either.

>    function(keys, values, rereduce) {
>      var wordCount = 0;
>      var weightedScore = 0;
>      if( !rereduce ) {
>        // This is the reduce phase, we are reducing over emitted values
>    from the map functions.
>        var sessions = {};
>        for(var k in keys){
>            //caculate the total value for every session(contain multi
>    sessiondialog<=>couchdb document)
>            var key = keys[k][0];
>            key = key?key.join('_'):key;
>            if (!sessions[key]) {
>                sessions[key] = values[k];
>            }else{
>                sessions[key].wordCount += values[k].wordCount;
>                sessions[key].weightedScore += values[k].weightedScore;
>                sessions[key].grade =
>    sessions[key].weightedScore/sessions[key].wordCount;
>            }
>        }
>        //caculate the top session for each dialog
>        var dialogsessions = {};
>        for(var sk in sessions){
>            var dialogId = sk?sk.split('_')[1]:sk;
>            if(!dialogsessions[dialogId]){
>                dialogsessions[dialogId] = sessions[sk];
>            }else if(dialogsessions[dialogId].grade < sessions[sk].grade){
>                dialogsessions[dialogId] = sessions[sk];
>            }
>        }
>        //caculate the result
>        for(var ds in dialogsessions){
>            wordCount += dialogsessions[ds].wordCount;
>            weightedScore += dialogsessions[ds].weightedScore;
>        }
>      } else {
>        // This is the rereduce phase, we are re-reducing previosuly
>    reduced values.
>        for(var i in values) {
>          wordCount += values[i].wordCount;
>          weightedScore += values[i].weightedScore;
>        }
>      }
>      return {"wordCount"    : wordCount,
>              "weightedScore"    : weightedScore,
>              "grade" : weightedScore/wordCount
>         };
>    }

This looks wrong to me. If the re-reduce function only has to sum wordCount
and weightedScore, why isn't such simple logic also in the reduce function?

Your reduce function is non-linear. In particular, it searches for the
maximum value of something. This seems incompatible with a linear re-reduce
function. It's certainly fine to have reduce and re-reduce functions which
calculate maxima, but I would expect "find maximum" logic in both reduce and
re-reduce.

With your logic, my suspicion is that you will get different answers
depending on exactly how the input documents are divided between the reduce
and re-reduce phases. Since you have no control over that, it means your
answers will be wrong sometimes.

Perhaps it will help you to understand this if you consider the limiting
case where exactly one document is fed into the 'reduce' function at a time,
and then the outputs of the reduce functions are combined with a large
re-reduce phase. (Also consider what happens if exactly one document at a
time is fed into the 'reduce' function, and then pairs of reduce values are
fed into 'rereduce' forming a binary tree)

I wouldn't be surprised if there's a constant you can tweak somewhere in the
CouchDB source code which would let you actually get this behaviour in
practice.

Also, someone wrote a Javascript implementation of map/reduce which would
let you play with this interactively. This is linked from the very bottom of
http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views

HTH,

Brian.

Re: 'Grouping' documents so that a set of documents is passed to the view function

Posted by hhsuper <hh...@gmail.com>.

Thx Brian very much for the quickly reply, I should say my description isn't
very clealy, there is some complex business logic need to be impl within
reduce.

I descripe the application scenario carefully: when user learn from one
dialog, they start a session( sessionid), the study on every line in dialog
generate a couchdb document(there are uid/dialogid/sessionid,
wordcount/weightedScore/grade for the line), the user could re-study the
same dialog some days later, so they start a new session but for the same
dialog, we want get every user's average grade from their study
results(dialog as unit, so we need sum for specified session) but for the
same dialog we only want to use the highest grade of  session not use all
session

this seem to difficult to impl with one view,  as impl in rdbms we need
build sql query on a subquery(or on a db view), is that proper to impl with
couchdb's view?

you are right brian, wordcount/weightedScore can be simple summed, average
grade = weightedScore / wordcount, I paster my previous reduce function code
bellow(code maybe already complex), by the way when you said "root node with
*all* the uids", i think i don't very clearly about the view's internal
store structure and i can't find in wiki:

function(keys, values, rereduce) {
  var wordCount = 0;
  var weightedScore = 0;
  if( !rereduce ) {
    // This is the reduce phase, we are reducing over emitted values from
the map functions.
    var sessions = {};
    for(var k in keys){
        //caculate the total value for every session(contain multi
sessiondialog<=>couchdb document)
        var key = keys[k][0];
        key = key?key.join('_'):key;
        if (!sessions[key]) {
            sessions[key] = values[k];
        }else{
            sessions[key].wordCount += values[k].wordCount;
            sessions[key].weightedScore += values[k].weightedScore;
            sessions[key].grade =
sessions[key].weightedScore/sessions[key].wordCount;
        }
    }
    //caculate the top session for each dialog
    var dialogsessions = {};
    for(var sk in sessions){
        var dialogId = sk?sk.split('_')[1]:sk;
        if(!dialogsessions[dialogId]){
            dialogsessions[dialogId] = sessions[sk];
        }else if(dialogsessions[dialogId].grade < sessions[sk].grade){
            dialogsessions[dialogId] = sessions[sk];
        }
    }
    //caculate the result
    for(var ds in dialogsessions){
        wordCount += dialogsessions[ds].wordCount;
        weightedScore += dialogsessions[ds].weightedScore;
    }
  } else {
    // This is the rereduce phase, we are re-reducing previosuly reduced
values.
    for(var i in values) {
      wordCount += values[i].wordCount;
      weightedScore += values[i].weightedScore;
    }
  }

  return {"wordCount"    : wordCount,
          "weightedScore"    : weightedScore,
          "grade" : weightedScore/wordCount
     };
}

On Wed, Jun 24, 2009 at 11:43 PM, Brian Candler <B....@pobox.com> wrote:

> On Wed, Jun 24, 2009 at 06:35:56PM +0800, hhsuper wrote:
> >    map function emit structure(key cols refer to uid/dialogid/sessionid):
> >    emit( ["86", "10380", "4172"], {wordCount: 20, weightedScore: 1380,
> >    grade: 69})
> >    reduce function return: {wordCount: 20, weightedScore: 1380, grade:
> 69}
> >    the reduce function's logic: first caculate the sum value for every
> >    unique  uid_dialogid_sessionid key, then get the max value for every
> >    unique uid_dialogid key, at last sum the values for the key uid, these
> >    caculate on wordCount/weightedScore/grade
>
> Code would probably speak clearer than words here. Since I don't understand
> your algorithm from that description, I can only talk in generalities.
>
> Assuming that you have some uid and some calculated values against that uid
> (and the same uid appears in multiple documents), then one option would be
> a
> reduce function which emits
>
>   {
>    uid1: {wordCount: 20, weightedScore: 1380, grade: 69},
>    uid2: {...etc}
>   }
>
> Then the rereduce function performs the same logic for all the uids seen in
> the input. However the output of such a reduce function will grow without
> bounds, and the root node will include the information for *all* the uids.
> This is not good.
>
> A better reduce function would output null if it has multiple uids in its
> input. If it sees only a single uid across all its inputs, it can output
>
>  {uid: 1234, wordCount: 20, weightedScore: 1380, grade: 69}
>
> Then the re-reduce function would do the same: if all its inputs have the
> same uid then it calculates the relevant values, otherwise outputs null.
> This obviously reduces to null, except when you do a query where the key
> range covers documents with only one uid (or you group by uid), in which
> case you'll get the info you're looking for.
>
> All this depends on the logic by which wordCount, weightedScore and grade
> from multiple documents may be combined, and whether the intermediate
> results can also be combined. I mean, I imagine the wordCount's can simply
> be summed, but can the other values be combined similarly?
>
> But in any case: reduce functions are not suitable for all purposes. If you
> can't get the answer you need from a reduce function, then you need to
> perform the calculation client-side. Sorry, that's how it is.
>
> Regards,
>
> Brian.
>



-- 
Yours sincerely

Jack Su

Re: 'Grouping' documents so that a set of documents is passed to the view function

Posted by Brian Candler <B....@pobox.com>.

On Wed, Jun 24, 2009 at 06:35:56PM +0800, hhsuper wrote:
>    map function emit structure(key cols refer to uid/dialogid/sessionid):
>    emit( ["86", "10380", "4172"], {wordCount: 20, weightedScore: 1380,
>    grade: 69})
>    reduce function return: {wordCount: 20, weightedScore: 1380, grade: 69}
>    the reduce function's logic: first caculate the sum value for every
>    unique  uid_dialogid_sessionid key, then get the max value for every
>    unique uid_dialogid key, at last sum the values for the key uid, these
>    caculate on wordCount/weightedScore/grade

Code would probably speak clearer than words here. Since I don't understand
your algorithm from that description, I can only talk in generalities.

Assuming that you have some uid and some calculated values against that uid
(and the same uid appears in multiple documents), then one option would be a
reduce function which emits

   {
    uid1: {wordCount: 20, weightedScore: 1380, grade: 69},
    uid2: {...etc}
   }

Then the rereduce function performs the same logic for all the uids seen in
the input. However the output of such a reduce function will grow without
bounds, and the root node will include the information for *all* the uids.
This is not good.

A better reduce function would output null if it has multiple uids in its
input. If it sees only a single uid across all its inputs, it can output

  {uid: 1234, wordCount: 20, weightedScore: 1380, grade: 69}

Then the re-reduce function would do the same: if all its inputs have the
same uid then it calculates the relevant values, otherwise outputs null.
This obviously reduces to null, except when you do a query where the key
range covers documents with only one uid (or you group by uid), in which
case you'll get the info you're looking for.

All this depends on the logic by which wordCount, weightedScore and grade
from multiple documents may be combined, and whether the intermediate
results can also be combined. I mean, I imagine the wordCount's can simply
be summed, but can the other values be combined similarly?

But in any case: reduce functions are not suitable for all purposes. If you
can't get the answer you need from a reduce function, then you need to
perform the calculation client-side. Sorry, that's how it is.

Regards,

Brian.

Re: 'Grouping' documents so that a set of documents is passed to the view function

Posted by hhsuper <hh...@gmail.com>.

Hi, Brian

>From your description give me some thought about my previous view, which
impl in reduce function depend on group query to process documents, which
will encounter some problem:

map function emit structure(key cols refer to uid/dialogid/sessionid):
*emit( ["86", "10380", "4172"], *{wordCount: 20, weightedScore: 1380, grade:
69})

reduce function return: {wordCount: 20, weightedScore: 1380, grade: 69}

the reduce function's logic: first caculate the sum value for every unique
uid_dialogid_sessionid key, then get the max value for every unique
uid_dialogid key, at last sum the values for the key uid, these
caculate onwordCount/
weightedScore/grade

but for the couchdb's increcemental update view, when the new document added
with uid/dialogid/sessionid value same with previous caculated document, the
new document's value will be caculated in reduce function but without
exactly comply the reduce logic because there is no previous value to
execute some compare

i need to modify the reduce to complex and with return large object which i
don't wanted, any suggestion?

On Tue, Jun 23, 2009 at 3:59 PM, Brian Candler <B....@pobox.com> wrote:

> On Tue, Jun 23, 2009 at 09:15:46AM +0800, hhsuper wrote:
> >    Hi Brian I know your mean, even though, we can still with reduce
> >    function which need us impl rereduce correctly process with these
> >    grouped documents and get right result, is that right brian?
>
> If sufficient information is passed through from reduce to re-reduce, yes.
> But it may mean your 'reduce' function has to output a large object and
> your
> re-reduce function is complex.
>



-- 
Yours sincerely

Jack Su

Re: 'Grouping' documents so that a set of documents is passed to the view function

Posted by Brian Candler <B....@pobox.com>.

On Tue, Jun 23, 2009 at 09:15:46AM +0800, hhsuper wrote:
>    Hi Brian I know your mean, even though, we can still with reduce
>    function which need us impl rereduce correctly process with these
>    grouped documents and get right result, is that right brian?

If sufficient information is passed through from reduce to re-reduce, yes.
But it may mean your 'reduce' function has to output a large object and your
re-reduce function is complex.

Re: 'Grouping' documents so that a set of documents is passed to the view function

Posted by hhsuper <hh...@gmail.com>.

Hi Brian I know your mean, even though, we can still with reduce function
which need us impl rereduce correctly process with these grouped documents
and get right result, is that right brian?

On Tue, Jun 23, 2009 at 3:05 AM, Brian Candler <B....@pobox.com> wrote:

> On Mon, Jun 22, 2009 at 09:49:15PM +0800, hhsuper wrote:
> > Brian's decription for reduce function is clearly, but i think you can
> > achieve your goal as below:
> >
> > function(doc){
> >   emit(doc.group_key, doc)
> > }
> >
> > function(keys,values,rereduce){
> >   //...
> > }
> >
> > with the group=true option you can impl doc process by group_key in
> reduce
> > function,
> > as your example reduce will be invoke two times with values:
> > [firstdoc,seconddoc] which groupkey=1.... [thirddoc,forthdoc]
>  groupkey=2,
>
> No, what I'm saying is it will *not necessarily* be invoked that way. If
> there are only a handful of documents with the same group key then it is
> quite likely to be invoked that way. However if there are dozens or
> hundreds
> of documents with the same group key, it could be invoked like this:
>
>  reduce(first bunch of documents)    => R1
>  reduce(second bunch of documents)   => R2
>  rereduce(R1,R2)
>



-- 
Yours sincerely

Jack Su

Re: 'Grouping' documents so that a set of documents is passed to the view function

Posted by Brian Candler <B....@pobox.com>.

On Mon, Jun 22, 2009 at 09:49:15PM +0800, hhsuper wrote:
> Brian's decription for reduce function is clearly, but i think you can
> achieve your goal as below:
> 
> function(doc){
>   emit(doc.group_key, doc)
> }
> 
> function(keys,values,rereduce){
>   //...
> }
> 
> with the group=true option you can impl doc process by group_key in reduce
> function,
> as your example reduce will be invoke two times with values:
> [firstdoc,seconddoc] which groupkey=1.... [thirddoc,forthdoc]  groupkey=2,

No, what I'm saying is it will *not necessarily* be invoked that way. If
there are only a handful of documents with the same group key then it is
quite likely to be invoked that way. However if there are dozens or hundreds
of documents with the same group key, it could be invoked like this:

  reduce(first bunch of documents)    => R1
  reduce(second bunch of documents)   => R2
  rereduce(R1,R2)