You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@couchdb.apache.org by Tom Wright <to...@inflatablecookie.com> on 2008/08/04 20:31:37 UTC

Optionally including docs in view results

Hi all.

I'm currently building a search engine with couch. The search criteria 
is particularly complicated, so a view is generated at search-time, with 
the bare minimum information emitted (if it hasn't been indexed already).
This all works fine, however trying to display the results introduces a 
seemingly unnecessary overhead..

Because there will potentially be a lot of views, emitting the whole 
document isn't really an option as way too much disk space will be 
wasted. Currently, the only other option is to iterate through the view 
results and query each hit individually for the full document body. As 
I'm working in PHP, there's a fairly substantial crunch involved in 
querying a large number of single records in a loop.

Is there a way (or is a way planned) for the documents to be optionally 
attached to a view's result rows via a query string argument?

Example:

/my_db/_view/my_design_doc/my_view?include_docs=true

produces

"rows":
[{
"id":"id",
"key": "key"
"value": { "my_emitted_data":123 }
"doc": { -- document data here -- }
},
{
etc..
}]

I know this is a relatively edge case.. it'd be really handy to have though!
Any thoughts / suggestions?

Cheers
Tom


Re: Optionally including docs in view results

Posted by Paul Davis <pa...@gmail.com>.
On Tue, Aug 5, 2008 at 3:06 PM, Chris Anderson <jc...@grabb.it> wrote:
> On Tue, Aug 5, 2008 at 11:26 AM, Paul Davis <pa...@gmail.com> wrote:
>> There was a similar idea posted on another thread that I really liked,
>> looked at implementing and got scared of because of the mentioned
>> replication stuff in bulk docs.
>>
>> The basic idea was to be able to post something like the following to _bulk_docs
>>
>> {
>>  "put": {doc1, doc2, doc3}
>>  "delete": {doc4, doc5}
>>  "get": {doc6}
>> }
>
> This idea has its upsides, but I'm wary of breaking
> backwards-compatibility with _bulk_docs.
>

I feel that I should point out that couch is at 0.9, I don't think
it'd be too big of a deal to break compatibility. Especially if it'd
benefit a large segment of users.

> However, I don't see anything wrong with the way _bulk_docs currently
> handles create/update/delete (even if it isn't RESTful).
>

I've never been accused of being a REST purist. :D Is this violating
rest because a resource is exposed at more than URI? If so, phooey to
that. If I'm missing some finer point then I wouldn't mind being
enlightened.

> On IRC there have been some good arguments in favor of an include_docs
> option for view, which could be used with _all_docs and multi-key view
> requests to yield the _load_docs (fetch docs in bulk) feature.
>
> Also with that design Tom's original question on this thread would be
> a simple case.
>
>
> --
> Chris Anderson
> http://jchris.mfdz.com
>

I'm kind of partial to the _transaction method, but I'm not at against
an include docs either. But I get the feeling that we're going to run
into some implementation issues somewhere. I haven't completely groked
couch's concurrency model and I wonder what Damien would say about
fetching docs after passing over a view (I'm pretty sure I heard him
talk once about taking moving across spindles into account which is
way beyond the level at which I think of things)

Re: Optionally including docs in view results

Posted by Michael Hendricks <mi...@ndrix.org>.
On Wed, Aug 06, 2008 at 12:23:51PM +0100, Tom Wright wrote:
> Having had a bit of time to stew on this issue, I'm coming to the 
> conclusion that there needs to be more of a separation of concerns.
> Fetching records should not be lumped into the same call as updating / 
> inserting / deleting - ie. _bulk_docs. However, I reckon _bulk_docs as it 
> is is trying to do too much - why not go for something like the following:
>
> _bulk_get
> _bulk_insert
> _bulk_update
> _bulk_delete
>
> This way, a request will only ever have one purpose, and the response can 
> be simpler and more relevant, especially where errors are concerned. Maybe 
> change insert / update for put / post in terms of naming, but you get the 
> idea.

Allowing insert, update and delete operations in a single request is
quite useful though.  It allows for a transaction so that I can insert a
new document and update another document to point to it and be assured
that the database's integrity is maintained.  If one part fails, the
whole thing fails.

-- 
Michael

Re: Optionally including docs in view results

Posted by Darren New <dn...@san.rr.com>.
Tom Wright wrote:
> _bulk_get
> _bulk_insert
> _bulk_update
> _bulk_delete

It might be useful, also, to think in terms of a meta-level of naming, 
even inside the JSON. For example, if documents are modified with a PUT, 
then rather than having _bulk_docs take an array with a tag of "docs" on 
it, have _bulk_docs take an array with a tag of "PUT" on it. In other 
words, have a one-to-one mapping from the RESTful interface to the 
_bulk* interface(s), rather than having each operation be ad hoc.

I've found in the past this makes things (a) much easier to describe and 
learn and (b) much easier to extend.

Something like setting the deleted flag=true to delete it in bulk docs 
and using a different verb in REST is the sort of thing that leads to 
needing _tag syntax in the first place.

Just a thought on the design.

-- 
Darren New / San Diego, CA, USA (PST)
  Ever notice how people in a zombie movie never already know how to
  kill zombies? Ask 100 random people in America how to kill someone
  who has reanimated from the dead in a secret viral weapons lab,
  and how many do you think already know you need a head-shot?

Re: Optionally including docs in view results

Posted by Tom Wright <to...@inflatablecookie.com>.
I agree, to an extent.

Having had a bit of time to stew on this issue, I'm coming to the 
conclusion that there needs to be more of a separation of concerns.
Fetching records should not be lumped into the same call as updating / 
inserting / deleting - ie. _bulk_docs. However, I reckon _bulk_docs as 
it is is trying to do too much - why not go for something like the 
following:

_bulk_get
_bulk_insert
_bulk_update
_bulk_delete

This way, a request will only ever have one purpose, and the response 
can be simpler and more relevant, especially where errors are concerned. 
Maybe change insert / update for put / post in terms of naming, but you 
get the idea.

Troy Kruthoff wrote:
> Getting the API future-proof as possible before 1.0 release is more 
> important than backwards compatibility.  The main reason I am a 
> proponent of a bulk_docs api supporting the REST verbs is IMHO it is 
> easier to use and understand than having multiple API end-points to 
> accomplish RESTful access in a performant manner.  Couch is marketing 
> itself (among other things) as a RESTful database for web apps, which 
> is an obvious buzzword and one that I believe will aide in the 
> adoption of the technology, but the fact is we are having this 
> discussion because we need more performance than what the RESTful 
> access is bringing to the table.
>
> So, to the web developer we can say:
>
> "When you only need to access 1-5 documents for you web page, use the 
> REST api.  If you need to get 100 documents, modifiy them and save 
> them back, then use the bulk_load api to get the documents and the 
> bulk_docs api to save them, because we are the RESTful database built 
> for tomorrows web apps"
>
> The message that I believe is more marketable, easier to understand 
> and use is that you can wrap multiple REST calls into a single request 
> to the server by POSTing your requests as a JSON payload to the 
> bulk_docs uri when you want to achieve maximum performance (or 
> _transaction, or bulk_rest, or any other name).
>
> -- troy



Re: Optionally including docs in view results

Posted by Troy Kruthoff <tk...@blit.com>.
Getting the API future-proof as possible before 1.0 release is more  
important than backwards compatibility.  The main reason I am a  
proponent of a bulk_docs api supporting the REST verbs is IMHO it is  
easier to use and understand than having multiple API end-points to  
accomplish RESTful access in a performant manner.  Couch is marketing  
itself (among other things) as a RESTful database for web apps, which  
is an obvious buzzword and one that I believe will aide in the  
adoption of the technology, but the fact is we are having this  
discussion because we need more performance than what the RESTful  
access is bringing to the table.

So, to the web developer we can say:

"When you only need to access 1-5 documents for you web page, use the  
REST api.  If you need to get 100 documents, modifiy them and save  
them back, then use the bulk_load api to get the documents and the  
bulk_docs api to save them, because we are the RESTful database built  
for tomorrows web apps"

The message that I believe is more marketable, easier to understand  
and use is that you can wrap multiple REST calls into a single request  
to the server by POSTing your requests as a JSON payload to the  
bulk_docs uri when you want to achieve maximum performance (or  
_transaction, or bulk_rest, or any other name).

-- troy




On Aug 5, 2008, at 12:06 PM, Chris Anderson wrote:

> On Tue, Aug 5, 2008 at 11:26 AM, Paul Davis <paul.joseph.davis@gmail.com 
> > wrote:
>> There was a similar idea posted on another thread that I really  
>> liked,
>> looked at implementing and got scared of because of the mentioned
>> replication stuff in bulk docs.
>>
>> The basic idea was to be able to post something like the following  
>> to _bulk_docs
>>
>> {
>> "put": {doc1, doc2, doc3}
>> "delete": {doc4, doc5}
>> "get": {doc6}
>> }
>
> This idea has its upsides, but I'm wary of breaking
> backwards-compatibility with _bulk_docs.
>
> However, I don't see anything wrong with the way _bulk_docs currently
> handles create/update/delete (even if it isn't RESTful).
>
> On IRC there have been some good arguments in favor of an include_docs
> option for view, which could be used with _all_docs and multi-key view
> requests to yield the _load_docs (fetch docs in bulk) feature.
>
> Also with that design Tom's original question on this thread would be
> a simple case.
>
>
> -- 
> Chris Anderson
> http://jchris.mfdz.com


Re: Optionally including docs in view results

Posted by Chris Anderson <jc...@grabb.it>.
On Tue, Aug 5, 2008 at 11:26 AM, Paul Davis <pa...@gmail.com> wrote:
> There was a similar idea posted on another thread that I really liked,
> looked at implementing and got scared of because of the mentioned
> replication stuff in bulk docs.
>
> The basic idea was to be able to post something like the following to _bulk_docs
>
> {
>  "put": {doc1, doc2, doc3}
>  "delete": {doc4, doc5}
>  "get": {doc6}
> }

This idea has its upsides, but I'm wary of breaking
backwards-compatibility with _bulk_docs.

However, I don't see anything wrong with the way _bulk_docs currently
handles create/update/delete (even if it isn't RESTful).

On IRC there have been some good arguments in favor of an include_docs
option for view, which could be used with _all_docs and multi-key view
requests to yield the _load_docs (fetch docs in bulk) feature.

Also with that design Tom's original question on this thread would be
a simple case.


-- 
Chris Anderson
http://jchris.mfdz.com

Re: Optionally including docs in view results

Posted by Paul Davis <pa...@gmail.com>.
There was a similar idea posted on another thread that I really liked,
looked at implementing and got scared of because of the mentioned
replication stuff in bulk docs.

The basic idea was to be able to post something like the following to _bulk_docs

{
  "put": {doc1, doc2, doc3}
  "delete": {doc4, doc5}
  "get": {doc6}
}

(Originally there was a post section too, but there's been talk on how
that's kinda shaky, so not sure about it right now)

And then it'd return something like:

{
   "put": { [doc1._id, doc1._rev, ...]}
   "get": { doc6 }
}

(Obviously thats all horribly invalid JSON, but hopefully it gets the
point across)

The idea is that updates/deletes/gets are processed in probably that
order. So if you request a doc you just deleted, you'll get a delted
doc. And if you delete a doc you just updated, then you just wasted
everyone's time.

A thought that just occurred, would it help at all to rename
_bulk_docs to _transaction? That sounds a bit better to me. Also
sitting on the outside with no erlang fu, it seems kind of weird that
replication is using special additions to bulk_docs parameters. Unless
of course damien yells at me says those are just undocumented features
that might have other uses :D

And I'm spent.


On Tue, Aug 5, 2008 at 12:30 PM, Chris Anderson <jc...@grabb.it> wrote:
> On Tue, Aug 5, 2008 at 9:02 AM, Troy Kruthoff <tk...@blit.com> wrote:
>> +1 for the bulk_docs api, specifically being able to perform multiple
>> REST-type operations in a single request should be "bulk_docs" by
>> definition.
>>
>
> Part of the complexity comes from the fact that _bulk_docs is already
> used in replication, so it has some options that aren't normally used
> in a standard POST from an application.
>
> Currently with _bulk_docs you can do the equivalent of PUT
> (create/update with _id), POST (create without _id), and DELETE
> (update set _deleted=true). So most of the RESTful options are
> available, they just aren't segregated by verb.
>
> You can't do GET yet - to me it just seems confusing to allow both
> modification of documents and fetching of documents in the same http
> request.
>
> If I'm wrong, and we really should include document fetches in
> _bulk_docs, it won't be impossible to code, but we should make sure
> the feature is clear.
>
> The POST body can certainly accommodate both "docs" and "ids" members,
> its just a matter of how to structure the response so that it's clear
> which parts of the response come from which CouchDB actions. (Not to
> mention all the edge-cases around documents that are both updated by
> the "docs" member, and requested by the "ids" member, or maybe the
> update phase succeeds but one of the request docs is 404.)
>
> --
> Chris Anderson
> http://jchris.mfdz.com
>

Re: Optionally including docs in view results

Posted by Chris Anderson <jc...@grabb.it>.
On Tue, Aug 5, 2008 at 9:02 AM, Troy Kruthoff <tk...@blit.com> wrote:
> +1 for the bulk_docs api, specifically being able to perform multiple
> REST-type operations in a single request should be "bulk_docs" by
> definition.
>

Part of the complexity comes from the fact that _bulk_docs is already
used in replication, so it has some options that aren't normally used
in a standard POST from an application.

Currently with _bulk_docs you can do the equivalent of PUT
(create/update with _id), POST (create without _id), and DELETE
(update set _deleted=true). So most of the RESTful options are
available, they just aren't segregated by verb.

You can't do GET yet - to me it just seems confusing to allow both
modification of documents and fetching of documents in the same http
request.

If I'm wrong, and we really should include document fetches in
_bulk_docs, it won't be impossible to code, but we should make sure
the feature is clear.

The POST body can certainly accommodate both "docs" and "ids" members,
its just a matter of how to structure the response so that it's clear
which parts of the response come from which CouchDB actions. (Not to
mention all the edge-cases around documents that are both updated by
the "docs" member, and requested by the "ids" member, or maybe the
update phase succeeds but one of the request docs is 404.)

-- 
Chris Anderson
http://jchris.mfdz.com

Re: Optionally including docs in view results

Posted by Troy Kruthoff <tk...@blit.com>.
+1 for the bulk_docs api, specifically being able to perform multiple  
REST-type operations in a single request should be "bulk_docs" by  
definition.

-- troy


On Aug 4, 2008, at 2:47 PM, Chris Anderson wrote:

> On Mon, Aug 4, 2008 at 2:42 PM, Dean Landolt <de...@deanlandolt.com>  
> wrote:
>>
>> {
>> "put": [ { "_id": "doc1", "foo": "bar" } ],
>> "post": [ { "baz": "foo" } ],
>> "delete": [ {"_id": "doc2" } ],
>> "get": [ { "_id": "doc3" }, { "_id": "doc4"} ]
>> }
>>
>
> I bet I can do something like this without breaking existing clients:
>
> # current functionality
>
> {
> "docs":[array of docs]
> }
>
> # bulk load overload
>
> {
> "ids":[array of ids]
> }
>
> I think it would be much too complex to allow the same POST to both
> create some docs and request others. I can modify my patch so that it
> works like this (and ignores the "ids" member if "docs" is set). I
> don't want to waste time, so I'll wait until we have a consensus that
> overload _bulk_docs is better than _load_docs.
>
> Do any committers have feedback?
>
>
> -- 
> Chris Anderson
> http://jchris.mfdz.com


Re: Optionally including docs in view results

Posted by Chris Anderson <jc...@grabb.it>.
On Mon, Aug 4, 2008 at 2:42 PM, Dean Landolt <de...@deanlandolt.com> wrote:
>
> {
> "put": [ { "_id": "doc1", "foo": "bar" } ],
> "post": [ { "baz": "foo" } ],
> "delete": [ {"_id": "doc2" } ],
> "get": [ { "_id": "doc3" }, { "_id": "doc4"} ]
> }
>

I bet I can do something like this without breaking existing clients:

# current functionality

{
"docs":[array of docs]
}

# bulk load overload

{
"ids":[array of ids]
}

I think it would be much too complex to allow the same POST to both
create some docs and request others. I can modify my patch so that it
works like this (and ignores the "ids" member if "docs" is set). I
don't want to waste time, so I'll wait until we have a consensus that
overload _bulk_docs is better than _load_docs.

Do any committers have feedback?


-- 
Chris Anderson
http://jchris.mfdz.com

Re: Optionally including docs in view results

Posted by Dean Landolt <de...@deanlandolt.com>.
On Mon, Aug 4, 2008 at 5:34 PM, Tom Wright <to...@inflatablecookie.com> wrote:

> I have a suspicion there was a discussion on the list a little while back
> about being able to perform this operation via _bulk_docs, although I don't
> know where off the top of my head, and I have no idea whether that was
> veto'd or not as I think there were a few issues with making it play nice
> with the current functionality.
>
> If _bulk_docs can be made to work with fetching records as well as updating
> them, I think that may be the neater way to go. On the other hand, if it
> can't, then the patch you've posted looks good enough to me, although I'm
> hardly an authority :)


After a little googlin' through my gmail I came up with Paul Davis post from
a few weeks ago:

There was an idea floated on another thread to make _bulk_docs support
a post body of something like:

{
"put": [ { "_id": "doc1", "foo": "bar" } ],
"post": [ { "baz": "foo" } ],
"delete": [ {"_id": "doc2" } ],
"get": [ { "_id": "doc3" }, { "_id": "doc4"} ]
}

I looked briefly into this, but the code that runs bulk docs is a lot
deeper than I thought it was. There's quite a bit of stuff related to
replication and consistency.

There was also a suggestion in that thread to use a GET to _bulk_docs to
signify multi-key loads but was nixed because of potential url length
limitations. So it looks like unless someone can untangle the _bulk_docs
code (I'm still in the very early stages of learning Erlang and it's slow
going for now, so it certainly won't be me), it looks like a separate
_load_docs endpoint is the way to go.

Re: Optionally including docs in view results

Posted by Tom Wright <to...@inflatablecookie.com>.
I have a suspicion there was a discussion on the list a little while 
back about being able to perform this operation via _bulk_docs, although 
I don't know where off the top of my head, and I have no idea whether 
that was veto'd or not as I think there were a few issues with making it 
play nice with the current functionality.

If _bulk_docs can be made to work with fetching records as well as 
updating them, I think that may be the neater way to go. On the other 
hand, if it can't, then the patch you've posted looks good enough to me, 
although I'm hardly an authority :)

Chris Anderson wrote:
> On Mon, Aug 4, 2008 at 12:05 PM, Tom Wright <to...@inflatablecookie.com> wrote:
>   
>> Yeah, I pretty much came to a similar conclusion in the last 10 minutes or
>> so..
>> Is the _load_docs API on the roadmap or is it just conjecture at the moment?
>>
>>     
>
> I've submitted a patch that creates the _load_docs API. I'm not sure
> about the names, so feedback is appreciated.
>
> https://issues.apache.org/jira/browse/COUCHDB-98
>
> Chris
>
>
>   



Re: Optionally including docs in view results

Posted by Chris Anderson <jc...@grabb.it>.
On Mon, Aug 4, 2008 at 12:05 PM, Tom Wright <to...@inflatablecookie.com> wrote:
> Yeah, I pretty much came to a similar conclusion in the last 10 minutes or
> so..
> Is the _load_docs API on the roadmap or is it just conjecture at the moment?
>

I've submitted a patch that creates the _load_docs API. I'm not sure
about the names, so feedback is appreciated.

https://issues.apache.org/jira/browse/COUCHDB-98

Chris


-- 
Chris Anderson
http://jchris.mfdz.com

Re: Optionally including docs in view results

Posted by Tom Wright <to...@inflatablecookie.com>.
Yeah, I pretty much came to a similar conclusion in the last 10 minutes 
or so..
Is the _load_docs API on the roadmap or is it just conjecture at the moment?

Chris Anderson wrote:
> On Mon, Aug 4, 2008 at 11:40 AM, Chris Anderson <jc...@grabb.it> wrote:
>   
>> On Mon, Aug 4, 2008 at 11:31 AM, Tom Wright <to...@inflatablecookie.com> wrote:
>>     
>>> Is there a way (or is a way planned) for the documents to be optionally
>>> attached to a view's result rows via a query string argument?
>>>       
>> We were just discussing this on IRC the other day. It should be a
>> relatively simple patch to make.
>>     
>
> I thought about this some more, and even started writing some tests
> for it. Now I'm pretty convinced that this is not the way to go.
>
> Because map functions can emit() many times per document, you'd often
> end up transferring much more data than you need to - if a document
> maps to 5 keys in your key range you'll get the same document over the
> wire 5 times.
>
> The better course of action would be to use the (non-existent)
> _load_docs API to load multiple documents. Then you'd do 2 requests
> from PHP - one to load the view results, and one to request the
> (unique) set of documents associated with the view rows.
>
>   



Re: Optionally including docs in view results

Posted by Dean Landolt <de...@deanlandolt.com>.
+1 for _load_docs :)

On Mon, Aug 4, 2008 at 2:57 PM, Chris Anderson <jc...@grabb.it> wrote:

> On Mon, Aug 4, 2008 at 11:40 AM, Chris Anderson <jc...@grabb.it> wrote:
> > On Mon, Aug 4, 2008 at 11:31 AM, Tom Wright <to...@inflatablecookie.com>
> wrote:
> >>
> >> Is there a way (or is a way planned) for the documents to be optionally
> >> attached to a view's result rows via a query string argument?
> >
> > We were just discussing this on IRC the other day. It should be a
> > relatively simple patch to make.
>
> I thought about this some more, and even started writing some tests
> for it. Now I'm pretty convinced that this is not the way to go.
>
> Because map functions can emit() many times per document, you'd often
> end up transferring much more data than you need to - if a document
> maps to 5 keys in your key range you'll get the same document over the
> wire 5 times.
>
> The better course of action would be to use the (non-existent)
> _load_docs API to load multiple documents. Then you'd do 2 requests
> from PHP - one to load the view results, and one to request the
> (unique) set of documents associated with the view rows.
>
> --
> Chris Anderson
> http://jchris.mfdz.com
>

Re: Optionally including docs in view results

Posted by Chris Anderson <jc...@grabb.it>.
On Mon, Aug 4, 2008 at 11:40 AM, Chris Anderson <jc...@grabb.it> wrote:
> On Mon, Aug 4, 2008 at 11:31 AM, Tom Wright <to...@inflatablecookie.com> wrote:
>>
>> Is there a way (or is a way planned) for the documents to be optionally
>> attached to a view's result rows via a query string argument?
>
> We were just discussing this on IRC the other day. It should be a
> relatively simple patch to make.

I thought about this some more, and even started writing some tests
for it. Now I'm pretty convinced that this is not the way to go.

Because map functions can emit() many times per document, you'd often
end up transferring much more data than you need to - if a document
maps to 5 keys in your key range you'll get the same document over the
wire 5 times.

The better course of action would be to use the (non-existent)
_load_docs API to load multiple documents. Then you'd do 2 requests
from PHP - one to load the view results, and one to request the
(unique) set of documents associated with the view rows.

-- 
Chris Anderson
http://jchris.mfdz.com

Re: Optionally including docs in view results

Posted by Chris Anderson <jc...@grabb.it>.
On Mon, Aug 4, 2008 at 11:31 AM, Tom Wright <to...@inflatablecookie.com> wrote:
>
> Is there a way (or is a way planned) for the documents to be optionally
> attached to a view's result rows via a query string argument?

We were just discussing this on IRC the other day. It should be a
relatively simple patch to make.

The only caveat is that it won't work for reduce view (because there
is no 1 docid) but on the good side it will give people even less
reason to emit the whole doc to the map index.




-- 
Chris Anderson
http://jchris.mfdz.com