You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@couchdb.apache.org by Sho Fukamachi <sh...@gmail.com> on 2008/08/05 07:15:27 UTC

Re: when to use another document and when not to?

On 28/07/2008, at 3:18 AM, Paul Carey wrote:

> At the risk of misunderstanding / stating the obvious, there's nothing
> to stop you doing both 2 and 3 and eliminating the n+1 query loop.
> Each user can maintain an array of both those he follows and those who
> follow him. You could update both follower and leader in a bulk
> transaction to ensure consistency.

At this stage I don't think there's anything obvious to state! Yes -  
this could be option 4 on that list, storing the relationship on  
*both* sides, which does indeed solve the basic problem of how to get  
one from the other minus the n+1 query.

I had also considered this solution but I think there are a few  
problems with this "multi-master" style of recording this information.

Firstly, as far as I can see it has replication problems.

  For example, say you've got two servers, one in Japan and one in the  
USA. A user in the US adds a tag to a photo, both Photo and Tag  
records are updated. Meanwhile the owner of the photo in Japan  
successfully deletes an inappropriate tag from the same photo. What  
happens when the two servers then try to reconcile these records a  
couple of minutes later? The tag records will be fine, the photo  
record will be in conflict with two new versions of the doc - one has  
a tag ID deleted, the other has one added. The servers have no way of  
knowing what to do without going back and trying to rebuild from the  
tags, which will require intelligence on the program side - in a well- 
designed system this condition should never be allowed to arise.

Second, it doesn't solve the "contention" problem for popular records.  
In fact it magnifies it since now the updates are happening on both  
sides. This makes the above replication issue worse. In a photos  
situation, imagine a few more things you're trying to store in the  
Photo document - users who have made it a favourite, for example. Now  
imagine the photo gets linked on Digg and a few thousand people try to  
make it a favourite and tag it in the space of a few minutes. Your  
users are likely "operating" on a copy of the actual record - they  
have in effect "checked it out" and are making their changes (adding  
tags, favourites, etc). Obviously this is going to be very bad as when  
they check it back in, any changes in the meantime are lost - unless  
you start added logic to deal with all of that. But you can see it  
would get complex and doesn't really play to couch's strengths.

Third, this is kind of related to the above two - if you use this  
technique for caching various forms of data, you have no inbuilt way  
of checking if that data has changed. Say you want to cache the  
username of all users who made this photo a favourite. The user then  
changes their username. Your cache is out of date. You'd have to write  
something to go and look at every single one of these caches and  
update it with the new username. Again this seems to be a bad way of  
doing it and requires too much "intelligence".

Fourth is kind of a philosophical problem which many may not agree  
with, and it may be the RDBMS devil on my shoulder speaking, but to me  
the membership (tag relationship, follower relationship, whatever) is  
a discrete piece of data and should have its own document. Having this  
discrete information existing solely in the metadata of other records  
kind of bothers me. This kind of ties in with the other points as well  
- I want the relationships to be rebuildable. If the arrays on both  
sides become inconsistent, there should be some way of regenerating  
them.

So for these reasons I think that just storing the array on both sides  
is a bad idea. From thinking about this I keep coming back to the  
"membership" doc as being a necessity. With a few improvements on the  
previous implementation.

The approach I have settled on (for now) is that you do create the  
Membership document, and then you and cache all the information you  
need in it - including the revs of the two other objects it refers to.  
So you might have TagMembership, and it includes the Photo name, photo  
ID, photo rev, and all that for the Tag as well.

It gains you:

- a "canonical" document - the data is not "multi-master" and is  
easily replicable
- because you have the _revs of both remote docs you can detect  
obsoletions in an automated "dumb" way - if a user changes their name,  
for example, you could grab every membership doc that doesn't have the  
new user_rev and bulk update them.
- one relationship, one doc, no contention problems
- you can then grab all the tag names for a specific Photo ID, or all  
the photo names for a specific Tag ID, in a single view
- using the membership doc as a source, you can then go and cache data  
on either side at will if you really want to - the important point  
being that it's rebuildable from the "canonical" doc

This seems to me to be the best way of doing this for now, I'd like to  
hear any arguments or other ideas. If I am not making sense then I can  
provide code examples if anyone would like that...

Sho




> I've created a simple example. I used photos and tags instead of
> followers because I find the self referentiality of the follower model
> adds unnecessary confusion, but the underlying concept - a many to
> many relationship - remains.
> http://friendpaste.com/ev5DAJTR
>
> Cheers
>
> Paul


Re: when to use another document and when not to?

Posted by Chris Anderson <jc...@grabb.it>.
On Wed, Aug 6, 2008 at 1:01 PM, Chris Anderson <jc...@grabb.it> wrote:
> The event system could be prototyped by eavesdropping on the
> communication between CouchDB and the view server, and firing events
> based on the emitted keys.
>

replying to myself...

the problem with the event approach is that you get some data lag -
those emit() events aren't fired until the view is requested. That
means if you have a remap based on a rarely accessed map view, the
remap could be stale but not marked as such, until the view is
accessed.

Your solution (associating doc-ids and revs with rows in the remap
index) is nice in that you can just watch the change history of the
database to see if any of the associated documents were touched.



-- 
Chris Anderson
http://jchris.mfdz.com

Re: when to use another document and when not to?

Posted by Chris Anderson <jc...@grabb.it>.
On Wed, Aug 6, 2008 at 12:50 PM, Sho Fukamachi <sh...@gmail.com> wrote:
>
> Chris, what did you think about my idea of attaching/making available _rev
> information for views?
>
> Basically if we could get a _rev for every key-value output, could get an
> array of them (however massive) for the "root" of a particular view, and a
> "top" _rev for the view, it would make tracking changes much easier. I have
> been messing around and think it would be pretty easy to make a decent
> "sync" for remapping purposes if that info was available.

I think that it's a feasible way to start down the path... I'd be
curious to see what Damien thinks - he may have an eye on how to do
this in a way that could eventually be used to support remapping
natively.

My first idea of how to do this was with an event system. If there was
a way to "subscribe" to changes in a view (and which key they effect)
you could use it to mark parts of the remap index as dirty. Then to
regenerate you'd only have to recalculate the dirty parts (and also
any keys in the view which aren't in the remap - making that search
efficient might be problematic).

The event system could be prototyped by eavesdropping on the
communication between CouchDB and the view server, and firing events
based on the emitted keys.

-- 
Chris Anderson
http://jchris.mfdz.com

Re: when to use another document and when not to?

Posted by Sho Fukamachi <sh...@gmail.com>.
On 05/08/2008, at 3:49 PM, Chris Anderson wrote:

> I think the missing link here is the ability to "remap" map and
> map/reduce results. In Hadoop-style map/reduce, the output of a single
> map will often be remapped in different ways for different purposes.
> Being able to share the intermediate results among further
> reprocessing is helpful, and often people will chain long stretches of
> map reduce processing.

Well, that would be absolutely fantastic if it came to pass. I didn't  
really think it was on the roadmap anytime soon though.

> The challenge for the CouchDB programming model for supporting chained
> map/reduces is the cache-expiry issue. How can we tell which index
> entries to sweep when a document is changed or deleted, when that
> index is itself generated by running map/reduce over another index? I
> tell myself that the bookkeeping is possible, but it sure sounds like
> a big job.

Hm. Not to risk exposing myself (accurately) as someone who has no  
grasp whatsoever of the complexities of such things - could a similar  
approach to the current _rev system be used?

Perhaps you could have two levels of revisions - one for the total  
view, which changed whenever anything in the view changed. That would  
signal the re-reducing view that it needed to go look at the view again.

And then a second level of revisions could be on individual key "row"  
output. The re-reduce could then just look at the ones that changed -  
it would simply drop any revs it had but didn't appear in the new  
listing, and import any that did - that would handle additions/ 
removals as well.

I'm probably oversimplifying things? Basically just trying to think of  
"the simplest thing that could possibly work" ...


> I have a prototype of remapping (with no cache-awareness) in
> CouchRest's git repo
> http://github.com/jchris/couchrest/tree/master/utils/remap.rb

Thanks - I'd been reading that anyway after noticing the bump to  
0.9.0. Looks great, will try it out!

> You're making sense, but I also wouldn't mind code examples :)

Sure, if you can stomach my awful code ...

http://friendpaste.com/DYsves9s

In that (unedited, confused, messy) example I create and utilise two  
methods of getting at the membership data. The first is the one I  
discussed, ie caching it all in the Membership class. The second is to  
then place those caches in the "remote" record itself. Well, basically  
I just try lots of things, and it's all there if you can stand huge  
messes of experimentation : )

If you want to run it, you'll need edge DataMapper - as in, from the  
last 24 hours. This is probably the easiest way to get it:

http://datamapper.org/articles/stunningly_easy_way_to_live_on_the_edge.html

Hope that's useful for someone.

Sho


> -- 
> Chris Anderson
> http://jchris.mfdz.com


Re: when to use another document and when not to?

Posted by Chris Anderson <jc...@grabb.it>.
On Mon, Aug 4, 2008 at 10:15 PM, Sho Fukamachi <sh...@gmail.com> wrote:
> So for these reasons I think that just storing the array on both sides is a
> bad idea. From thinking about this I keep coming back to the "membership"
> doc as being a necessity. With a few improvements on the previous
> implementation.

I think the missing link here is the ability to "remap" map and
map/reduce results. In Hadoop-style map/reduce, the output of a single
map will often be remapped in different ways for different purposes.
Being able to share the intermediate results among further
reprocessing is helpful, and often people will chain long stretches of
map reduce processing.

The challenge for the CouchDB programming model for supporting chained
map/reduces is the cache-expiry issue. How can we tell which index
entries to sweep when a document is changed or deleted, when that
index is itself generated by running map/reduce over another index? I
tell myself that the bookkeeping is possible, but it sure sounds like
a big job.

> to me the membership (tag relationship, follower relationship, whatever)
> is a discrete piece of data and should have its own document.

Using remapping, you could have the membership document
({user:user_id, tag:tag, photo:photo_id}), and still get to the goal,
which is a view that has photos sorted by tag, so that with ?key="tag"
you could load all the photos with a given tag. (A user or photo's
tagcloud can come from a view directly on the tagging document.)

I have a prototype of remapping (with no cache-awareness) in
CouchRest's git repo
http://github.com/jchris/couchrest/tree/master/utils/remap.rb

We use it at Grabb.it to build join indexes for doing quick lookups.
The downside is that the index (stored in a separate logical database)
has to be regenerated on the addition of new records, because it
doesn't track which documents contributed to a given key.

You're making sense, but I also wouldn't mind code examples :)

-- 
Chris Anderson
http://jchris.mfdz.com

Re: when to use another document and when not to?

Posted by Sho Fukamachi <sh...@gmail.com>.
On 07/08/2008, at 4:01 AM, Chris Anderson wrote:

> I'd love to see remapping as a feature of CouchDB, but realistically
> it seems more like a 2.0 feature. Since we're not to 1.0 yet... don't
> build your apps around the idea of having it. (Unless you can stomach
> regenerating the whole remap when you get new data, like I do with
> remap.rb)

Chris, what did you think about my idea of attaching/making available  
_rev information for views?

Basically if we could get a _rev for every key-value output, could get  
an array of them (however massive) for the "root" of a particular  
view, and a "top" _rev for the view, it would make tracking changes  
much easier. I have been messing around and think it would be pretty  
easy to make a decent "sync" for remapping purposes if that info was  
available.

Obviously that's not as nice as having it built in, but it would be a  
lot better than the blind system we have now. I'm thinking of  
requesting it as a feature, would like to know if you think it's a  
good idea or not, since you've obviously been playing around with it  
as well.

thanks,

Sho

Re: when to use another document and when not to?

Posted by Chris Anderson <jc...@grabb.it>.
On Wed, Aug 6, 2008 at 8:51 AM, Sho Fukamachi <sh...@gmail.com> wrote:
> I think the really exciting thing is the possibility of the re-mapping as
> alluded to by Chris Anderson yesterday. Not a feature of CDB (yet?), but a
> single remap would give us the ability to do all this kind of thing in one
> query. If the problems with expiries and consistency could be solved and
> remapping added to Couch then that would open the door to a whole new range
> of uses, this one included.

I'd love to see remapping as a feature of CouchDB, but realistically
it seems more like a 2.0 feature. Since we're not to 1.0 yet... don't
build your apps around the idea of having it. (Unless you can stomach
regenerating the whole remap when you get new data, like I do with
remap.rb)


-- 
Chris Anderson
http://jchris.mfdz.com

Re: when to use another document and when not to?

Posted by Paul Carey <pa...@gmail.com>.
>> It looks like multi key GETs / bulk_load (I got horribly confused
>> following yesterday's IRC discussion) is on the horizon which will
>> presumably make one of the original suggestions in this thread - a
>> simple join document - feasible.
>
> I am sure that people will start throwing rotten fruit at me for my endless
> harping on about this - but that's not a JOIN.

heh, not from me, I think pedantry is underrated :) I do realise it's
not a JOIN, but nonetheless, 'join document' doesn't seem like an
inappropriate name for such a document - as long as we're all clear
that it doesn't imply a join as known in SQL land.

Re: when to use another document and when not to?

Posted by Sho Fukamachi <sh...@gmail.com>.
On 06/08/2008, at 11:47 PM, Paul Carey wrote:

> +1 for all the reasons you list, except perhaps the fourth, for which
> I think I'm on the fence.

Ha, I'm on the fence about pretty much everything when it comes to  
data structures in CouchDB : )

> It looks like multi key GETs / bulk_load (I got horribly confused
> following yesterday's IRC discussion) is on the horizon which will
> presumably make one of the original suggestions in this thread - a
> simple join document - feasible.

I am sure that people will start throwing rotten fruit at me for my  
endless harping on about this - but that's not a JOIN. It's a two- 
stage query. It's a good way of doing things, and in many ways  
preferable to a JOIN even in an SQL database where we have JOINs - but  
it's not a JOIN!  However, it does get rid of the n+1 GETs from the  
previous example so it's a big improvement and probably fine for most  
cases.

I think the really exciting thing is the possibility of the re-mapping  
as alluded to by Chris Anderson yesterday. Not a feature of CDB  
(yet?), but a single remap would give us the ability to do all this  
kind of thing in one query. If the problems with expiries and  
consistency could be solved and remapping added to Couch then that  
would open the door to a whole new range of uses, this one included.

thanks,

Sho

Re: when to use another document and when not to?

Posted by Paul Carey <pa...@gmail.com>.
> So for these reasons I think that just storing the array on both sides is a
> bad idea.

+1 for all the reasons you list, except perhaps the fourth, for which
I think I'm on the fence.

Storing the relationships on both sides would indeed be feasible only
in a heavily constrained environment - limited contention and perhaps
bounded document size. A quick look at Flickr shows 4,249,311 photos
tagged with 'london' with a net change of a +/- 2 every second or so.
At four bytes per tag id, good luck writing that much data that
frequently.

> The approach I have settled on (for now) is that you do create the
> Membership document, and then you and cache all the information you need in it

> - you can then grab all the tag names for a specific Photo ID, or all the
> photo names for a specific Tag ID, in a single view

It looks like multi key GETs / bulk_load (I got horribly confused
following yesterday's IRC discussion) is on the horizon which will
presumably make one of the original suggestions in this thread - a
simple join document - feasible.

Paul