You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@couchdb.apache.org by Jim Klo <ji...@sri.com> on 2011/05/20 04:41:06 UTC

Reduce function to perform a Union?

I'm a little dumbfounded by reduce functions.

What I'm trying to do is take a view that has heterogeneous values and union into a single object; logically this seems like what the reduce function should be capable of doing, but it seems I keep getting the reduce overflow error. Effectively I'm reducing the view by 50%.

Consider the the simplistic scenario:

doc A: { _id : "abc123", type:"resource", keyword:"nasa" }
doc B: { _id : "abc123-timestamp", type: "timestamp", timestamp: "2011/05/19T12:00:00.0000Z", ref_doc: "abc123" }
doc N: ....

Doc A is the original doc... Doc B is the timestamp doc referencing Doc A via ref_doc field... Doc N is just another doc also referencing Doc A via ref_doc field.

I can create a view that essentially looks like:

Key				Value
------------		------------------
"abc123"		{ .... doc A object .... }
"abc123"		{ .... doc B object .... }
"abc123"		{ .... doc N object .... }

I would expect I could build a reduced view that looks something like this:

Key				Value
------------		------------------
"abc123"		{ .... merged doc .... }

Ultimately this goes back to an issue we have where we need the node local timestamp of a document, without generating an event that would cause an update to doc A, causing it to get replicated. We figure we can store local data like a timestamp then join it back with the original doc via a view & list.

Is there something magical about the reduce that's not well documented? Or maybe is there a better way to do this?  I know about using linked docs, were in my map function you can reference the _id of the linked document in the value you can return @ 1 - 1 merge with the include_docs=true, but don't think I can do that with N docs; or can I?

Jim Klo
Senior Software Engineer
Center for Software Engineering
SRI International

Re: Reduce function to perform a Union?

Posted by Kinley Dorji <ki...@gmail.com>.

In your example, you want to string together the different objects
corresponding to key abc123. I'm not sure how you worked your reduce
code but would suggest that if you probably need code like this in it
somewhere:

function(key, values, rereduce) {
  var obj = {};
  // iterable values obj contains all values corresponding to key group
  for (i=0; i<values.length; ++i) {
    for (var item in values[i]) {
        obj[item] = values[i][item];
    }
  }
  return obj;
}

In the above code, the result of the map.js code for your view will be
taken as the input, and for every key group (eg. "abc123"), it will
iterate through each row and do whatever string manipulations you
specify, and return the resulting object. The main takeaway is this:
values.length is the number of rows that match the keygroup being
handled, and the loop 'for (var item in values[i]' is what allows you
to access the values for each row i in the group of matching rows. Of
course, you'd need to modify this loop and the assigning of values to
obj (in my example) to suit your needs.

Again, I'm pretty sure this code would work, but I haven't really
tried it for large datasets, so I'd be interested to know if it does
scale.

FWIW.

You will probably need to write your own javascript function
On Fri, May 20, 2011 at 8:41 AM, Jim Klo <ji...@sri.com> wrote:
> I'm a little dumbfounded by reduce functions.
> What I'm trying to do is take a view that has heterogeneous values and union
> into a single object; logically this seems like what the reduce function
> should be capable of doing, but it seems I keep getting the reduce overflow
> error. Effectively I'm reducing the view by 50%.
> Consider the the simplistic scenario:
> doc A: { _id : "abc123", type:"resource", keyword:"nasa" }
> doc B: { _id : "abc123-timestamp", type: "timestamp", timestamp:
> "2011/05/19T12:00:00.0000Z", ref_doc: "abc123" }
> doc N: ....
> Doc A is the original doc... Doc B is the timestamp doc referencing Doc A
> via ref_doc field... Doc N is just another doc also referencing Doc A via
> ref_doc field.
> I can create a view that essentially looks like:
> Key Value
> ------------ ------------------
> "abc123" { .... doc A object .... }
> "abc123" { .... doc B object .... }
> "abc123" { .... doc N object .... }
> I would expect I could build a reduced view that looks something like this:
> Key Value
> ------------ ------------------
> "abc123" { .... merged doc .... }
> Ultimately this goes back to an issue we have where we need the node local
> timestamp of a document, without generating an event that would cause an
> update to doc A, causing it to get replicated. We figure we can store local
> data like a timestamp then join it back with the original doc via a view &
> list.
> Is there something magical about the reduce that's not well documented? Or
> maybe is there a better way to do this?  I know about using linked docs,
> were in my map function you can reference the _id of the linked document in
> the value you can return @ 1 - 1 merge with the include_docs=true, but don't
> think I can do that with N docs; or can I?
> Jim Klo
> Senior Software Engineer
> Center for Software Engineering
> SRI International
>
>
>
>

Re: Reduce function to perform a Union?

Posted by Jim Klo <ji...@sri.com>.

Response below:


On May 23, 2011, at 4:25 PM, Randall Leeds <ra...@gmail.com> wrote:

> On Fri, May 20, 2011 at 11:11, Jim Klo <ji...@sri.com> wrote:
> I was trying to not have to perform the merge outside of couch. Ultimately the problem we are trying to solve is a problem related to supporting the implementation of time based range queries and flow control over a contiguous range for OAI-PMH harvest support which has a requirement for consistency in range requests.
> 
> Let me describe our application a bit. The attached image illustrates the system design, where the light blue circles are instances of the application services. Each node is connected to each other in an ad-hoc manner.  Some nodes may communicate bi-directionally, some may not. Our aim is to use filtered replication via Couch to distribute documents. At each node CouchDB is used to store a JSON document. Since the system is constantly receiving new content, storing a timestamp with the document is pointless, when considering that there's no way to guarantee the consistency of the global content set at any single point in time.  Updating the JSON doc with a new timestamp at insert/update (replication) would just cause the document to replicate again - causing a cascade effect. Currently all documents inserted into Couch are considered immutable in our design.
> 
> 
> <snip big image>
>  
> 
> Consider a timestamped documents being replicated to a node from other node, using the date for simplicity, in the attached image. Looking at the replication timeline on a single node, where the content is influenced at any time.
> 
> If at 3:00 I query Node X for all the Orange documents between 4/1 and 4/4, I get one object back.  The consistency of the result changes at 4:00 if I query Node X again for all Orange documents between 4/1 and 4/4. OAI-PMH protocol requires that the result remains consistent.
> <Dock-3.jpg>
> Ideally what I need to be able to execute a query like at 3:00 and 6:00 where I ask for all the Orange documents received at between 1:00 and 3:00. Regardless of the time I can always get the stream of documents received between a specific time slice, matching some specific criteria.
> 
> Effectively what I need to be able to do is be able to create a view using the local sequence of the document with other document traits (keywords, publisher, etc).  If there's a way to do this, then my problem is solved.  I haven't found this ability to be in couch, hence, what we've done is essentially on replication/publish, we have a change listener that inserts a 'sidecar document' containing a local timestamp (sequence) that which we can use document linking or view collation w/ a reduce to figure out the right result set with a minimal number of transforms to be eventually returned to the consumer.
>  
> 
> Using a document linking method like:
> 
> function(doc) {
> 	if (doc.doc_type == "resource_data_timestamp" && doc.node_timestamp) {
> 		emit(doc.node_timestamp, { "_id": doc.resource_doc_id } );
> 	}
> }
> 
> where the resource_data_timestamp doc looks something like:
> 
> {
>    "_id": "d2018b3fe169426b95e44a5580692d5a-timestamp",
>    "_rev": "1-719655db4a1df9b9efcc5edbd62289ed",
>    "doc_type": "resource_data_timestamp",
>    "resource_doc_id": "d2018b3fe169426b95e44a5580692d5a",
>    "doc_version": "0.20.0",
>    "node_timestamp": "2011-05-19T22:09:55.704004Z"
> }
> 
> we can perform a range query against the view by timestamp only and get the original doc via include_docs=true, but would then have to table scan to filter out documents that aren't "orange", which could be millions of records, in our use case. This method tho only lets me join 2 documents together.... We're also trying to determine a method of handling a delete policy, potentially using a tombstone document as well - which is where the collated key with map and reduce come into play, since there could be more than one type of "sidecar" document.
> 
> Is this making sense?  Really we're looking for any solution that we can ensure a consistent range result with some additional filtering thats relatively practical - that most importantly can scale.
> 
> Any advice would be great.
> 
> Thanks for the clear explanation. I'd love to try to help you out here.
> 
> In general I'm totally a fan of the immutable document/"sidecar" approach.
> I can't think of a good way to solve all your problems easily yet, but I'll keep ruminating. Anyway, this is what makes it all fun and worthwhile, right?
> 
> Listening on /_changes and inserting a timestamp document in response to receiving replicated documents allows, like you said, to have an index with the consistency guarantees you want.
> 
> Providing you're only searching for one "color" at a time (and not trying to do arbitrary intersection/union of tags or something), geocouch might give you the queries you need.
> 
> For example:
> Listen on /_changes and when you receive a document (from replication or user write), PUT a sidecar document into the database with the color field from the original document.
> Then, using geocouch, build a spatial map that emits [timestamp, color] as point geometry for these documents.
> Finding a time range of orange documents becomes a bounding box query in this setup.
> Naturally, take care to use filtering so sidecar documents don't replicate.
> Again, unions of colors won't work like this, so maybe that's a non-starter.
> 

I had considered the geocouch approach, but wasn't sure if I could model it that way. I assumed you had to use point geometry directly - I'm only using color as an example (Which can be plotted numerically). In reality the filtering key is most likely a crowdsourced value like keyword or schema names.  

That still doesn't solve field replication within the "sidecar" doc. Eventually as things progress - I can see how the sidecar doc becomes a complete copy + timestamp. Creating a wasted space situation. 


> Sound like this is heading in the right direction?
> 
> While I know it's generally advised to make views deterministic, and I might get beaten with a stick for saying this, you *could* generate the node-local timestamp in the map function using the current time...
> 

I might be the one that does that! ;-) I'm not sure that works reliably though - I'm assuming you mean just requesting the view on the _change event, and using something like new Date().toString()? It would work as long as the view never gets rebuilt, which I'm not sure we could guarantee long term. 

> Anyway, I'm super interested in your use case and I'd like to help you solve this, so keep me in the loop!

Thanks! I'm curious how many others have a similar use case. It doesn't seem like it should be uncommon - but maybe the only users of replicated couchdb data so far have no flow consistency needs, which is different from the eventual consistency model replication follows, for those who might think I'm derranged and confused. I keep wondering how tough would it really be to expose the local sequence value to a view? Technically it seems like it's already being done with _changes somehow, it just needs to be accessible from map/reduce/list/show.  Having this sort of feature seems like it would open couch up to a whole other class of solution applications. 

> 
> Regards,
> Randall

Re: Reduce function to perform a Union?

Posted by Randall Leeds <ra...@gmail.com>.

On Fri, May 20, 2011 at 11:11, Jim Klo <ji...@sri.com> wrote:

> I was trying to not have to perform the merge outside of couch. Ultimately
> the problem we are trying to solve is a problem related to supporting the
> implementation of time based range queries and flow control over a
> contiguous range for OAI-PMH harvest support which has a requirement for
> consistency in range requests.
>
> Let me describe our application a bit. The attached image illustrates the
> system design, where the light blue circles are instances of the application
> services. Each node is connected to each other in an ad-hoc manner.  Some
> nodes may communicate bi-directionally, some may not. Our aim is to use
> filtered replication via Couch to distribute documents. At each node CouchDB
> is used to store a JSON document. Since the system is constantly receiving
> new content, storing a timestamp with the document is pointless, when
> considering that there's no way to guarantee the consistency of the global
> content set at any single point in time.  Updating the JSON doc with a new
> timestamp at insert/update (replication) would just cause the document to
> replicate again - causing a cascade effect. Currently all documents inserted
> into Couch are considered immutable in our design.
>
>
<snip big image>


>
> Consider a timestamped documents being replicated to a node from other
> node, using the date for simplicity, in the attached image. Looking at the
> replication timeline on a single node, where the content is influenced at
> any time.
>
> If at 3:00 I query Node X for all the Orange documents between 4/1 and 4/4,
> I get one object back.  The consistency of the result changes at 4:00 if I
> query Node X again for all Orange documents between 4/1 and 4/4. OAI-PMH
> protocol requires that the result remains consistent.
> Ideally what I need to be able to execute a query like at 3:00 and 6:00
> where I ask for all the Orange documents received at between 1:00 and 3:00.
> Regardless of the time I can always get the stream of documents received
> between a specific time slice, matching some specific criteria.
>
> Effectively what I need to be able to do is be able to create a view using
> the local sequence of the document with other document traits (keywords,
> publisher, etc).  If there's a way to do this, then my problem is solved.  I
> haven't found this ability to be in couch, hence, what we've done is
> essentially on replication/publish, we have a change listener that inserts a
> 'sidecar document' containing a local timestamp (sequence) that which we can
> use document linking or view collation w/ a reduce to figure out the right
> result set with a minimal number of transforms to be eventually returned to
> the consumer.
>
>

> Using a document linking method like:
>
> function(doc) {
> if (doc.doc_type == "resource_data_timestamp" && doc.node_timestamp) {
> emit(doc.node_timestamp, { "_id": doc.resource_doc_id } );
> }
> }
>
> where the resource_data_timestamp doc looks something like:
>
> {
>    "_id": "d2018b3fe169426b95e44a5580692d5a-timestamp",
>    "_rev": "1-719655db4a1df9b9efcc5edbd62289ed",
>    "doc_type": "resource_data_timestamp",
>    "resource_doc_id": "d2018b3fe169426b95e44a5580692d5a",
>    "doc_version": "0.20.0",
>    "node_timestamp": "2011-05-19T22:09:55.704004Z"
> }
>
>
> we can perform a range query against the view by timestamp only and get the
> original doc via include_docs=true, but would then have to table scan to
> filter out documents that aren't "orange", which could be millions of
> records, in our use case. This method tho only lets me join 2 documents
> together.... We're also trying to determine a method of handling a delete
> policy, potentially using a tombstone document as well - which is where the
> collated key with map and reduce come into play, since there could be more
> than one type of "sidecar" document.
>
> Is this making sense?  Really we're looking for any solution that we can
> ensure a consistent range result with some additional filtering thats
> relatively practical - that most importantly can scale.
>
> Any advice would be great.
>

Thanks for the clear explanation. I'd love to try to help you out here.

In general I'm totally a fan of the immutable document/"sidecar" approach.
I can't think of a good way to solve all your problems easily yet, but I'll
keep ruminating. Anyway, this is what makes it all fun and worthwhile,
right?

Listening on /_changes and inserting a timestamp document in response to
receiving replicated documents allows, like you said, to have an index with
the consistency guarantees you want.

Providing you're only searching for one "color" at a time (and not trying to
do arbitrary intersection/union of tags or something), geocouch might give
you the queries you need.

For example:
Listen on /_changes and when you receive a document (from replication or
user write), PUT a sidecar document into the database with the color field
from the original document.
Then, using geocouch, build a spatial map that emits [timestamp, color] as
point geometry for these documents.
Finding a time range of orange documents becomes a bounding box query in
this setup.
Naturally, take care to use filtering so sidecar documents don't replicate.
Again, unions of colors won't work like this, so maybe that's a non-starter.

Sound like this is heading in the right direction?

While I know it's generally advised to make views deterministic, and I might
get beaten with a stick for saying this, you *could* generate the node-local
timestamp in the map function using the current time...

Anyway, I'm super interested in your use case and I'd like to help you solve
this, so keep me in the loop!

Regards,
Randall

Re: Reduce function to perform a Union?

Posted by Jim Klo <ji...@sri.com>.

I was trying to not have to perform the merge outside of couch. Ultimately the problem we are trying to solve is a problem related to supporting the implementation of time based range queries and flow control over a contiguous range for OAI-PMH harvest support which has a requirement for consistency in range requests.

Let me describe our application a bit. The attached image illustrates the system design, where the light blue circles are instances of the application services. Each node is connected to each other in an ad-hoc manner.  Some nodes may communicate bi-directionally, some may not. Our aim is to use filtered replication via Couch to distribute documents. At each node CouchDB is used to store a JSON document. Since the system is constantly receiving new content, storing a timestamp with the document is pointless, when considering that there's no way to guarantee the consistency of the global content set at any single point in time.  Updating the JSON doc with a new timestamp at insert/update (replication) would just cause the document to replicate again - causing a cascade effect. Currently all documents inserted into Couch are considered immutable in our design.

Consider a timestamped documents being replicated to a node from other node, using the date for simplicity, in the attached image. Looking at the replication timeline on a single node, where the content is influenced at any time.

If at 3:00 I query Node X for all the Orange documents between 4/1 and 4/4, I get one object back.  The consistency of the result changes at 4:00 if I query Node X again for all Orange documents between 4/1 and 4/4. OAI-PMH protocol requires that the result remains consistent.

Ideally what I need to be able to execute a query like at 3:00 and 6:00 where I ask for all the Orange documents received at between 1:00 and 3:00. Regardless of the time I can always get the stream of documents received between a specific time slice, matching some specific criteria.

Effectively what I need to be able to do is be able to create a view using the local sequence of the document with other document traits (keywords, publisher, etc).  If there's a way to do this, then my problem is solved.  I haven't found this ability to be in couch, hence, what we've done is essentially on replication/publish, we have a change listener that inserts a 'sidecar document' containing a local timestamp (sequence) that which we can use document linking or view collation w/ a reduce to figure out the right result set with a minimal number of transforms to be eventually returned to the consumer.

Using a document linking method like:

function(doc) {
	if (doc.doc_type == "resource_data_timestamp" && doc.node_timestamp) {
		emit(doc.node_timestamp, { "_id": doc.resource_doc_id } );
	}
}

where the resource_data_timestamp doc looks something like:

{
   "_id": "d2018b3fe169426b95e44a5580692d5a-timestamp",
   "_rev": "1-719655db4a1df9b9efcc5edbd62289ed",
   "doc_type": "resource_data_timestamp",
   "resource_doc_id": "d2018b3fe169426b95e44a5580692d5a",
   "doc_version": "0.20.0",
   "node_timestamp": "2011-05-19T22:09:55.704004Z"
}

we can perform a range query against the view by timestamp only and get the original doc via include_docs=true, but would then have to table scan to filter out documents that aren't "orange", which could be millions of records, in our use case. This method tho only lets me join 2 documents together.... We're also trying to determine a method of handling a delete policy, potentially using a tombstone document as well - which is where the collated key with map and reduce come into play, since there could be more than one type of "sidecar" document.

Is this making sense?  Really we're looking for any solution that we can ensure a consistent range result with some additional filtering thats relatively practical - that most importantly can scale.

Any advice would be great.

Jim Klo
Senior Software Engineer
Center for Software Engineering
SRI International

On May 20, 2011, at 6:59 AM, Stephen Prater wrote:

> Why do you need to reduce those docs?  In this particular exmple, you can do a range query on the key and get the same (basically) results.
> 
> Also, I think the growth rate for reduce functions is log(rows) - so reducing the view size by 50% is still going to run up against the limit.
> 
> On May 19, 2011, at 9:41 PM, Jim Klo wrote:
> 
>> I'm a little dumbfounded by reduce functions.
>> 
>> What I'm trying to do is take a view that has heterogeneous values and union into a single object; logically this seems like what the reduce function should be capable of doing, but it seems I keep getting the reduce overflow error. Effectively I'm reducing the view by 50%.
>> 
>> Consider the the simplistic scenario:
>> 
>> doc A: { _id : "abc123", type:"resource", keyword:"nasa" }
>> doc B: { _id : "abc123-timestamp", type: "timestamp", timestamp: "2011/05/19T12:00:00.0000Z", ref_doc: "abc123" }
>> doc N: ....
>> 
>> Doc A is the original doc... Doc B is the timestamp doc referencing Doc A via ref_doc field... Doc N is just another doc also referencing Doc A via ref_doc field.
>> 
>> I can create a view that essentially looks like:
>> 
>> Key				Value
>> ------------		------------------
>> "abc123"		{ .... doc A object .... }
>> "abc123"		{ .... doc B object .... }
>> "abc123"		{ .... doc N object .... }
>> 
>> I would expect I could build a reduced view that looks something like this:
>> 
>> Key				Value
>> ------------		------------------
>> "abc123"		{ .... merged doc .... }
>> 
>> Ultimately this goes back to an issue we have where we need the node local timestamp of a document, without generating an event that would cause an update to doc A, causing it to get replicated. We figure we can store local data like a timestamp then join it back with the original doc via a view & list.
>> 
>> Is there something magical about the reduce that's not well documented? Or maybe is there a better way to do this?  I know about using linked docs, were in my map function you can reference the _id of the linked document in the value you can return @ 1 - 1 merge with the include_docs=true, but don't think I can do that with N docs; or can I?
>> 
>> Jim Klo
>> Senior Software Engineer
>> Center for Software Engineering
>> SRI International
>> 
>> 
>> 
>> 
>

Re: Reduce function to perform a Union?

Posted by Stephen Prater <st...@agrussell.com>.

Why do you need to reduce those docs?  In this particular exmple, you  
can do a range query on the key and get the same (basically) results.

Also, I think the growth rate for reduce functions is log(rows) - so  
reducing the view size by 50% is still going to run up against the  
limit.

On May 19, 2011, at 9:41 PM, Jim Klo wrote:

> I'm a little dumbfounded by reduce functions.
>
> What I'm trying to do is take a view that has heterogeneous values  
> and union into a single object; logically this seems like what the  
> reduce function should be capable of doing, but it seems I keep  
> getting the reduce overflow error. Effectively I'm reducing the view  
> by 50%.
>
> Consider the the simplistic scenario:
>
> doc A: { _id : "abc123", type:"resource", keyword:"nasa" }
> doc B: { _id : "abc123-timestamp", type: "timestamp", timestamp:  
> "2011/05/19T12:00:00.0000Z", ref_doc: "abc123" }
> doc N: ....
>
> Doc A is the original doc... Doc B is the timestamp doc referencing  
> Doc A via ref_doc field... Doc N is just another doc also  
> referencing Doc A via ref_doc field.
>
> I can create a view that essentially looks like:
>
> Key				Value
> ------------		------------------
> "abc123"		{ .... doc A object .... }
> "abc123"		{ .... doc B object .... }
> "abc123"		{ .... doc N object .... }
>
> I would expect I could build a reduced view that looks something  
> like this:
>
> Key				Value
> ------------		------------------
> "abc123"		{ .... merged doc .... }
>
> Ultimately this goes back to an issue we have where we need the node  
> local timestamp of a document, without generating an event that  
> would cause an update to doc A, causing it to get replicated. We  
> figure we can store local data like a timestamp then join it back  
> with the original doc via a view & list.
>
> Is there something magical about the reduce that's not well  
> documented? Or maybe is there a better way to do this?  I know about  
> using linked docs, were in my map function you can reference the _id  
> of the linked document in the value you can return @ 1 - 1 merge  
> with the include_docs=true, but don't think I can do that with N  
> docs; or can I?
>
> Jim Klo
> Senior Software Engineer
> Center for Software Engineering
> SRI International
>
>
>
>