You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@couchdb.apache.org by Dean Landolt <de...@deanlandolt.com> on 2008/11/11 20:57:50 UTC

joins, reprise

I know this join discussion has come up again and again, and with multi-key
get it's really not that hard to do a join in two steps. But I was playing
with include_docs and it got me thinking -- couldn't there be a dereference
parameter on views that acted like include docs? So you could pass in
?dereference=tags, for instance, and in the return the view could grab all
tag attributes and dereference them if their value is a string containing an
existing _id. This would be a *real* join, all without any complicated
accounting -- and it would put an end to the whole *but sql can do this*...
whining.

Bonus points for being able to pass in a list of dereference attributes
(?dereference=[tags,comments] or ?dereference=tags&dereference=comments) and
even better, if the given attribute's value is a list, iterating over it and
dereferencing any string containing an existing _id. This would be far nicer
than sql joins -- you would essentially get real many-to-manies sans
join-table awkwardness.

Does this seem logical? Doable?

Re: joins, reprise, and a suggestion for _external enhancement.

Posted by Paul Davis <pa...@gmail.com>.
On Tue, Nov 11, 2008 at 5:41 PM, Antony Blakey <an...@gmail.com> wrote:
>
> On 12/11/2008, at 7:02 AM, Dean Landolt wrote:
>
>> On Tue, Nov 11, 2008 at 3:08 PM, Paul Davis
>> <pa...@gmail.com>wrote:
>>
>>> I think this is an interesting idea, and has mostly been done with
>>> client libraries. ATM, I'm leaning towards saying that this is a
>>> client extension and doesn't really belong in couch. There are a crap
>>> load of optimizations that clients could make that couch couldn't.
>>>
>>> I have some ideas running around in my head about doing object graph
>>> loading etc. Things really start to get fun on the client when you
>>> contemplate referencing other databases etc.
>>>
>>> Anyway, if you can come up with some part of this functionality that
>>> *must* be done server side and has a big enough use case, ideas for
>>> patches are always welcome :D
>>
>>
>> I didn't think of it that way, but I agree. Perhaps a querytools plugin
>> would be in order when the plugin system lands, but this is probably best
>> left to the client. Does anybody know if jquery.couch does something like
>> this? If not, I'll have a go at hacking it in.
>
> I've written a plugin (in Erlang) that allows joins by tracking updates and
> replicating keys to Mnesia. It has the same behaviour as the existing
> map-reduce views e.g. updates only on view request. You can just as easily
> use e.g. SQLite rather than Mnesia.
>
> I did this to allow arbitrary join queries against my model without doing it
> on the client, and to centralise caching for high-performance joins such as
> transitive User->Permission checks when using a User/Role/Permission model
> that doesn't want both relationships to be stored on the Role object.
>
> I can separately compile and deploy this using the simple change to
> bin/couchdb that I described in an earlier post. It involves no change to
> the rest of CouchDB.
>
> *** Suggestion ***
>
> Now that I've done it, I realize that it's overkill, and I'm abandoning that
> approach, not only because very few people are going to want to do this in
> Erlang, but also because the same effect can be gained using _external. The
> single optimization required is for a request coming through _external to
> carry the db seqno for the request (presuming the _external request is
> qualified by a db). This allows you to easily avoid updating your external
> index, without making any request if there have been no changes since the
> last request. This is important because it's a performance hit if you have
> to make an additional HTTP request to CouchDB on every view request.
>
> You should also use both a startkey and an endkey in the _all_docs_by_seq
> request, which in effect gives the same semantics as map-reduce views i.e.
> you don't get into any race conditions during updating because you see one
> particular MVCC snapshot.
> You can synthesize this approach by simultaneously using a notification
> listener and arranging for that process to talk to your _external handler,
> but given that the Erlang endpoint has easy access to the seqno, why not
> supply it to the external process, and avoid the hassle. Listening for
> notifications is necessary if you are going to use something like memcached,
> or some query mechanism that doesn't go through CouchDB (and you want to
> avoid requesting the db seqno on every request).
>

Adding the seq num to the external protocol would be trivial and
you've made more than enough of a case for it IMO.

> Maybe a different external that enforces a db qualifier, such as
> _external_view, would be appropriate. There are further optimisations I
> considered, such as enriching the _external protocol to allow the external
> process to perform the _all_docs_by_seq request (and subsequent document
> GETs) over the port, but on reflection the minimal change is preferable.
>

I contemplated something of this nature too. But it seems like it'd be
a PITA to get right as well as lead to fairly little benefit. In the
future if we start hitting a huge penalty for things like action
scripts getting bogged down in pure connection overhead it might be an
optimization, but for the moment I don't think it should be a concern.

> Finally, by not doing this in Erlang it is easier to conceive of a mechanism
> that allows you to deploy the _external handler via CouchDB itself. Doing
> the plugin in Erlang introduces the Subject/Object problem, which will it
> can be overcome, is a lot more fraught IMO. Paul suggested this for Erlang
> plugins, but I think it's more applicable to this scenario.
>

I'm still grappling over this problem. I'm kinda more or less thinking
that given enough time, the easiest most bestest way to distribute
plugins is going to either be via javascript, or erlang once the
erlang infrastructure is set up. I should note that I see this part of
the infrastructure way down the road. As in, I have a shimmering
vision of something like firefox's add-on support where the program
itself can download and install new plugins etc. Obviously its gonna
be a long ass time before we can tell people to just download the XYZ
Plugin from the Pillow Factory.

> Antony Blakey
> -------------
> CTO, Linkuistics Pty Ltd
> Ph: 0438 840 787
>
> Did you hear about the Buddhist who refused Novocain during a root canal?
> His goal: transcend dental medication.
>
>
>

Paul

Re: joins, reprise, and a suggestion for _external enhancement.

Posted by Antony Blakey <an...@gmail.com>.
On 12/11/2008, at 7:02 AM, Dean Landolt wrote:

> On Tue, Nov 11, 2008 at 3:08 PM, Paul Davis <paul.joseph.davis@gmail.com 
> >wrote:
>
>> I think this is an interesting idea, and has mostly been done with
>> client libraries. ATM, I'm leaning towards saying that this is a
>> client extension and doesn't really belong in couch. There are a crap
>> load of optimizations that clients could make that couch couldn't.
>>
>> I have some ideas running around in my head about doing object graph
>> loading etc. Things really start to get fun on the client when you
>> contemplate referencing other databases etc.
>>
>> Anyway, if you can come up with some part of this functionality that
>> *must* be done server side and has a big enough use case, ideas for
>> patches are always welcome :D
>
>
> I didn't think of it that way, but I agree. Perhaps a querytools  
> plugin
> would be in order when the plugin system lands, but this is probably  
> best
> left to the client. Does anybody know if jquery.couch does something  
> like
> this? If not, I'll have a go at hacking it in.

I've written a plugin (in Erlang) that allows joins by tracking  
updates and replicating keys to Mnesia. It has the same behaviour as  
the existing map-reduce views e.g. updates only on view request. You  
can just as easily use e.g. SQLite rather than Mnesia.

I did this to allow arbitrary join queries against my model without  
doing it on the client, and to centralise caching for high-performance  
joins such as transitive User->Permission checks when using a User/ 
Role/Permission model that doesn't want both relationships to be  
stored on the Role object.

I can separately compile and deploy this using the simple change to  
bin/couchdb that I described in an earlier post. It involves no change  
to the rest of CouchDB.

*** Suggestion ***

Now that I've done it, I realize that it's overkill, and I'm  
abandoning that approach, not only because very few people are going  
to want to do this in Erlang, but also because the same effect can be  
gained using _external. The single optimization required is for a  
request coming through _external to carry the db seqno for the request  
(presuming the _external request is qualified by a db). This allows  
you to easily avoid updating your external index, without making any  
request if there have been no changes since the last request. This is  
important because it's a performance hit if you have to make an  
additional HTTP request to CouchDB on every view request.

You should also use both a startkey and an endkey in the  
_all_docs_by_seq request, which in effect gives the same semantics as  
map-reduce views i.e. you don't get into any race conditions during  
updating because you see one particular MVCC snapshot.
You can synthesize this approach by simultaneously using a  
notification listener and arranging for that process to talk to your  
_external handler, but given that the Erlang endpoint has easy access  
to the seqno, why not supply it to the external process, and avoid the  
hassle. Listening for notifications is necessary if you are going to  
use something like memcached, or some query mechanism that doesn't go  
through CouchDB (and you want to avoid requesting the db seqno on  
every request).

Maybe a different external that enforces a db qualifier, such as  
_external_view, would be appropriate. There are further optimisations  
I considered, such as enriching the _external protocol to allow the  
external process to perform the _all_docs_by_seq request (and  
subsequent document GETs) over the port, but on reflection the minimal  
change is preferable.

Finally, by not doing this in Erlang it is easier to conceive of a  
mechanism that allows you to deploy the _external handler via CouchDB  
itself. Doing the plugin in Erlang introduces the Subject/Object  
problem, which will it can be overcome, is a lot more fraught IMO.  
Paul suggested this for Erlang plugins, but I think it's more  
applicable to this scenario.

Antony Blakey
-------------
CTO, Linkuistics Pty Ltd
Ph: 0438 840 787

Did you hear about the Buddhist who refused Novocain during a root  
canal?
His goal: transcend dental medication.



Re: joins, reprise

Posted by Dean Landolt <de...@deanlandolt.com>.
On Tue, Nov 11, 2008 at 3:41 PM, Paul Davis <pa...@gmail.com>wrote:

> AFAIK, the jquery.couch doesn't have anything built in.


Yeah, I just gave it a once over. It doesn't look like it has support for
multi-key get yet either, so I'll see if I can't carve that in first.

Re: joins, reprise

Posted by Paul Davis <pa...@gmail.com>.
AFAIK, the jquery.couch doesn't have anything built in.

On Tue, Nov 11, 2008 at 3:32 PM, Dean Landolt <de...@deanlandolt.com> wrote:
> On Tue, Nov 11, 2008 at 3:08 PM, Paul Davis <pa...@gmail.com>wrote:
>
>> I think this is an interesting idea, and has mostly been done with
>> client libraries. ATM, I'm leaning towards saying that this is a
>> client extension and doesn't really belong in couch. There are a crap
>> load of optimizations that clients could make that couch couldn't.
>>
>> I have some ideas running around in my head about doing object graph
>> loading etc. Things really start to get fun on the client when you
>> contemplate referencing other databases etc.
>>
>> Anyway, if you can come up with some part of this functionality that
>> *must* be done server side and has a big enough use case, ideas for
>> patches are always welcome :D
>
>
> I didn't think of it that way, but I agree. Perhaps a querytools plugin
> would be in order when the plugin system lands, but this is probably best
> left to the client. Does anybody know if jquery.couch does something like
> this? If not, I'll have a go at hacking it in.
>
>
>>
>>
>> Paul
>>
>>
>> On Tue, Nov 11, 2008 at 2:57 PM, Dean Landolt <de...@deanlandolt.com>
>> wrote:
>> > I know this join discussion has come up again and again, and with
>> multi-key
>> > get it's really not that hard to do a join in two steps. But I was
>> playing
>> > with include_docs and it got me thinking -- couldn't there be a
>> dereference
>> > parameter on views that acted like include docs? So you could pass in
>> > ?dereference=tags, for instance, and in the return the view could grab
>> all
>> > tag attributes and dereference them if their value is a string containing
>> an
>> > existing _id. This would be a *real* join, all without any complicated
>> > accounting -- and it would put an end to the whole *but sql can do
>> this*...
>> > whining.
>> >
>> > Bonus points for being able to pass in a list of dereference attributes
>> > (?dereference=[tags,comments] or ?dereference=tags&dereference=comments)
>> and
>> > even better, if the given attribute's value is a list, iterating over it
>> and
>> > dereferencing any string containing an existing _id. This would be far
>> nicer
>> > than sql joins -- you would essentially get real many-to-manies sans
>> > join-table awkwardness.
>> >
>> > Does this seem logical? Doable?
>> >
>>
>

Re: joins, reprise

Posted by Dean Landolt <de...@deanlandolt.com>.
On Tue, Nov 11, 2008 at 3:08 PM, Paul Davis <pa...@gmail.com>wrote:

> I think this is an interesting idea, and has mostly been done with
> client libraries. ATM, I'm leaning towards saying that this is a
> client extension and doesn't really belong in couch. There are a crap
> load of optimizations that clients could make that couch couldn't.
>
> I have some ideas running around in my head about doing object graph
> loading etc. Things really start to get fun on the client when you
> contemplate referencing other databases etc.
>
> Anyway, if you can come up with some part of this functionality that
> *must* be done server side and has a big enough use case, ideas for
> patches are always welcome :D


I didn't think of it that way, but I agree. Perhaps a querytools plugin
would be in order when the plugin system lands, but this is probably best
left to the client. Does anybody know if jquery.couch does something like
this? If not, I'll have a go at hacking it in.


>
>
> Paul
>
>
> On Tue, Nov 11, 2008 at 2:57 PM, Dean Landolt <de...@deanlandolt.com>
> wrote:
> > I know this join discussion has come up again and again, and with
> multi-key
> > get it's really not that hard to do a join in two steps. But I was
> playing
> > with include_docs and it got me thinking -- couldn't there be a
> dereference
> > parameter on views that acted like include docs? So you could pass in
> > ?dereference=tags, for instance, and in the return the view could grab
> all
> > tag attributes and dereference them if their value is a string containing
> an
> > existing _id. This would be a *real* join, all without any complicated
> > accounting -- and it would put an end to the whole *but sql can do
> this*...
> > whining.
> >
> > Bonus points for being able to pass in a list of dereference attributes
> > (?dereference=[tags,comments] or ?dereference=tags&dereference=comments)
> and
> > even better, if the given attribute's value is a list, iterating over it
> and
> > dereferencing any string containing an existing _id. This would be far
> nicer
> > than sql joins -- you would essentially get real many-to-manies sans
> > join-table awkwardness.
> >
> > Does this seem logical? Doable?
> >
>

Re: joins, reprise

Posted by Paul Davis <pa...@gmail.com>.
I think this is an interesting idea, and has mostly been done with
client libraries. ATM, I'm leaning towards saying that this is a
client extension and doesn't really belong in couch. There are a crap
load of optimizations that clients could make that couch couldn't.

I have some ideas running around in my head about doing object graph
loading etc. Things really start to get fun on the client when you
contemplate referencing other databases etc.

Anyway, if you can come up with some part of this functionality that
*must* be done server side and has a big enough use case, ideas for
patches are always welcome :D

Paul


On Tue, Nov 11, 2008 at 2:57 PM, Dean Landolt <de...@deanlandolt.com> wrote:
> I know this join discussion has come up again and again, and with multi-key
> get it's really not that hard to do a join in two steps. But I was playing
> with include_docs and it got me thinking -- couldn't there be a dereference
> parameter on views that acted like include docs? So you could pass in
> ?dereference=tags, for instance, and in the return the view could grab all
> tag attributes and dereference them if their value is a string containing an
> existing _id. This would be a *real* join, all without any complicated
> accounting -- and it would put an end to the whole *but sql can do this*...
> whining.
>
> Bonus points for being able to pass in a list of dereference attributes
> (?dereference=[tags,comments] or ?dereference=tags&dereference=comments) and
> even better, if the given attribute's value is a list, iterating over it and
> dereferencing any string containing an existing _id. This would be far nicer
> than sql joins -- you would essentially get real many-to-manies sans
> join-table awkwardness.
>
> Does this seem logical? Doable?
>