You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@couchdb.apache.org by David Nolen <dn...@gmail.com> on 2009/11/11 05:19:53 UTC

Question about document copies & replication

Ok,

My mind is being blown by CouchDB :D

So I've realized that having a few databases per user is a really great idea
if you decide to scale by decentralizing your content (clients do the heavy
lifting by running queries on their local couchdb instances - since you're
replicating to the client you don't really care that a lot of data is
getting copied around). Every user has their own view of the world, and the
server CouchDB instance is really only for dealing with content shared
between users.

In our application (http://shiftspace.org), I'm thinking something along
these lines:

Client's laptop CouchDB instance:
user/private - all the documents a user has created plus replicated content
from groups and whatnot
user/public - all the user's public content
user/inbox - short messages

Server CouchDB instance:
group/x - group dbs of shared content. Replicated downstream to individual
user/private who belong to the group
master - user/public dbs replicated upstream to here.

So my question is this. When a user publishes a document, it is written to
user/private. If the user publishes a document to the world, we make a copy
of it in user/public - it's just the same data minus the _rev field.
Whenever a user updates a public document, we update the user/private copy
as well as the the user/public copy which will be replicated upstream to the
server.

So my question for the CouchDB gurus, will creating copies of documents in
this manner create potential problems?

Thanks much,
David

Re: Question about document copies & replication

Posted by David Nolen <dn...@gmail.com>.
On Tue, Nov 10, 2009 at 11:42 PM, Christopher O'Connell <
jwriteclub@gmail.com> wrote:

> It might make more sense to store a field indicating whether a document is
> public or private, and then use some software to only replicate public
> docs.
>

?

"Some software" as in _not_ CouchDB?


> Keeping multiple local copies just seems like a bad plan, and if you want
> to
> support two way replications, it will almost certainly run into problems.
>

In my scenario the publish documents is always "upstream". Users have the
canonical copy so to speak. There will never be reason to replicate from the
remote server to a local user/public. The existance of a local user/public
is specifically to have a filtered list of the user's data that needs to
replicated to the remote server without having to reinvent the wheel (i.e.
write our software to handle this). Aslo, in our application the user might
not only use our server for sharing content. He/she may decide to use
another server - by having user/public, it's becomes trivial for them to
replicate their public documents to another service.


> Indeed, you may want to replicate the whole user database as it is, and
> simply expose the public documents via a view on the server: Something like
>

The point is actually to not replicate everything. user/private never gets
replicated to the server. That is, it's is truly _private_.


> function(doc) {
>    if(doc.Visibility == "Public")
>        emit(doc._id,);
> }
>
> Those who really know what they're talking about are welcome to slap me
> around on this.
>
> ~ Christopher
>
> On Tue, Nov 10, 2009 at 9:19 PM, David Nolen <dn...@gmail.com>
> wrote:
>
> > Ok,
> >
> > My mind is being blown by CouchDB :D
> >
> > So I've realized that having a few databases per user is a really great
> > idea
> > if you decide to scale by decentralizing your content (clients do the
> heavy
> > lifting by running queries on their local couchdb instances - since
> you're
> > replicating to the client you don't really care that a lot of data is
> > getting copied around). Every user has their own view of the world, and
> the
> > server CouchDB instance is really only for dealing with content shared
> > between users.
> >
> > In our application (http://shiftspace.org), I'm thinking something along
> > these lines:
> >
> > Client's laptop CouchDB instance:
> > user/private - all the documents a user has created plus replicated
> content
> > from groups and whatnot
> > user/public - all the user's public content
> > user/inbox - short messages
> >
> > Server CouchDB instance:
> > group/x - group dbs of shared content. Replicated downstream to
> individual
> > user/private who belong to the group
> > master - user/public dbs replicated upstream to here.
> >
> > So my question is this. When a user publishes a document, it is written
> to
> > user/private. If the user publishes a document to the world, we make a
> copy
> > of it in user/public - it's just the same data minus the _rev field.
> > Whenever a user updates a public document, we update the user/private
> copy
> > as well as the the user/public copy which will be replicated upstream to
> > the
> > server.
> >
> > So my question for the CouchDB gurus, will creating copies of documents
> in
> > this manner create potential problems?
> >
> > Thanks much,
> > David
> >
>

Re: Question about document copies & replication

Posted by Christopher O'Connell <jw...@gmail.com>.
It might make more sense to store a field indicating whether a document is
public or private, and then use some software to only replicate public docs.
Keeping multiple local copies just seems like a bad plan, and if you want to
support two way replications, it will almost certainly run into problems.
Indeed, you may want to replicate the whole user database as it is, and
simply expose the public documents via a view on the server: Something like

function(doc) {
    if(doc.Visibility == "Public")
        emit(doc._id,);
}

Those who really know what they're talking about are welcome to slap me
around on this.

~ Christopher

On Tue, Nov 10, 2009 at 9:19 PM, David Nolen <dn...@gmail.com> wrote:

> Ok,
>
> My mind is being blown by CouchDB :D
>
> So I've realized that having a few databases per user is a really great
> idea
> if you decide to scale by decentralizing your content (clients do the heavy
> lifting by running queries on their local couchdb instances - since you're
> replicating to the client you don't really care that a lot of data is
> getting copied around). Every user has their own view of the world, and the
> server CouchDB instance is really only for dealing with content shared
> between users.
>
> In our application (http://shiftspace.org), I'm thinking something along
> these lines:
>
> Client's laptop CouchDB instance:
> user/private - all the documents a user has created plus replicated content
> from groups and whatnot
> user/public - all the user's public content
> user/inbox - short messages
>
> Server CouchDB instance:
> group/x - group dbs of shared content. Replicated downstream to individual
> user/private who belong to the group
> master - user/public dbs replicated upstream to here.
>
> So my question is this. When a user publishes a document, it is written to
> user/private. If the user publishes a document to the world, we make a copy
> of it in user/public - it's just the same data minus the _rev field.
> Whenever a user updates a public document, we update the user/private copy
> as well as the the user/public copy which will be replicated upstream to
> the
> server.
>
> So my question for the CouchDB gurus, will creating copies of documents in
> this manner create potential problems?
>
> Thanks much,
> David
>

Re: Question about document copies & replication

Posted by Chris Anderson <jc...@apache.org>.
On Wed, Nov 11, 2009 at 12:40 PM, David Nolen <dn...@gmail.com> wrote:

>
> Wait you can have views across multiple dbs? Are there instructions
> somewhere how to do this that I missed?
>

There no support for it. It's just that if you have access to
couchdb's json collation algorithm, it's possible to merge views in
constant space. CouchDB lounge does this for partitioning. The same
algorithm can merge the same (or different) views across multiple
databases. So you can use the CouchDB-Lounge merge code for other
operations.

I'm not sure it's super useful yet, but putting it into core could
open up a lot of possibilities.

Chris

Re: Question about document copies & replication

Posted by David Nolen <dn...@gmail.com>.
On Wed, Nov 11, 2009 at 2:10 PM, Chris Anderson <jc...@apache.org> wrote:

> On Wed, Nov 11, 2009 at 10:49 AM, David Nolen <dn...@gmail.com>
> wrote:
> > On Wed, Nov 11, 2009 at 1:42 PM, David Nolen <dn...@gmail.com>
> wrote:
> >
> >> user/inbox - replicated from server to client
> >> user/private - new and unpublished documents go here
> >>
> >
> > user/private also replicated to user/feed so that we can run queries on
> > merged data.
>
> That's one way to do it. You can also maintain views across multiple
> dbs, and merge them at query time. It's a little tougher to do on a
> thin client.
>

Wait you can have views across multiple dbs? Are there instructions
somewhere how to do this that I missed?


>
> Once couchdb-lounge style functionality arrives in CouchDB you'll be
> able to merge view queries from multiple databases on the server,
> which could help here.
>
> >
> > user/public - publish documents go here (and erased from user/private),
> >> replicated to user/feed and the firehose on the server
> >> user/feed - user/public and user/private replicated here (merged),
> general
> >> queries happen here
> >>
> >
>
>
>
> --
> Chris Anderson
> http://jchrisa.net
> http://couch.io
>

Re: Question about document copies & replication

Posted by Chris Anderson <jc...@apache.org>.
On Wed, Nov 11, 2009 at 10:49 AM, David Nolen <dn...@gmail.com> wrote:
> On Wed, Nov 11, 2009 at 1:42 PM, David Nolen <dn...@gmail.com> wrote:
>
>> user/inbox - replicated from server to client
>> user/private - new and unpublished documents go here
>>
>
> user/private also replicated to user/feed so that we can run queries on
> merged data.

That's one way to do it. You can also maintain views across multiple
dbs, and merge them at query time. It's a little tougher to do on a
thin client.

Once couchdb-lounge style functionality arrives in CouchDB you'll be
able to merge view queries from multiple databases on the server,
which could help here.

>
> user/public - publish documents go here (and erased from user/private),
>> replicated to user/feed and the firehose on the server
>> user/feed - user/public and user/private replicated here (merged), general
>> queries happen here
>>
>



-- 
Chris Anderson
http://jchrisa.net
http://couch.io

Re: Question about document copies & replication

Posted by David Nolen <dn...@gmail.com>.
On Wed, Nov 11, 2009 at 1:42 PM, David Nolen <dn...@gmail.com> wrote:

> user/inbox - replicated from server to client
> user/private - new and unpublished documents go here
>

user/private also replicated to user/feed so that we can run queries on
merged data.

user/public - publish documents go here (and erased from user/private),
> replicated to user/feed and the firehose on the server
> user/feed - user/public and user/private replicated here (merged), general
> queries happen here
>

Re: Question about document copies & replication

Posted by David Nolen <dn...@gmail.com>.
Many thanks for the informative reply!

On Wed, Nov 11, 2009 at 12:52 AM, Chris Anderson <jc...@apache.org> wrote:
>
> To publish a message, the user saves a document to a "publish"
> database. At replication time this is pushed to the global firehose.
>

Yup.


> Users pull from the firehose(s) using filtered replication to only
> copy docs authored by people they're interested in.
>

This isn't available yet right?


> The user can also maintain an inbox in the cloud. Eg the host of the
> global firehose (or other hosts) can maintain a database that people
> can write to but not read from. Then as a user I can send direct
> messages to other people's inboxes, which they will see at replication
> time.
>

Yes user/inbox is replicated from the server.


> I'd make the private database it's own db, and make another db for
> replicated content. I'm thinking the private data is gonna be
> important things like medical records and stuff, so I don't want to
> just mix it with everything else.
>

Good idea.


> This is why the user should save to the publish db, (eg instead of the
> "drafts" db), and let replication send the publish db to the firehose.
> Then it becomes clear that the private db is only for data I want to
> avoid replicating except very carefully.
>

Now that you talk about this approach I think something like the following
is better until filtered replication or replicating only certain documents
lands:

user/inbox - replicated from server to client
user/private - new and unpublished documents go here
user/public - publish documents go here (and erased from user/private),
replicated to user/feed and the firehose on the server
user/feed - user/public and user/private replicated here (merged), general
queries happen here


> Technically you can do what you're shooting for, but it might be
> better to use replication instead of saving to multiple dbs. I've been
> thinking the replicator deserves an option to specify an array of
> docids to replicate, which could be useful in this application.
>

This would be awesome.

Thanks again.

Glad to help,
>
> Chris
>
> > Thanks much,
> > David
> >
>
>
>
> --
> Chris Anderson
> http://jchrisa.net
> http://couch.io
>

Re: Question about document copies & replication

Posted by Chris Anderson <jc...@apache.org>.
On Tue, Nov 10, 2009 at 8:19 PM, David Nolen <dn...@gmail.com> wrote:
> Ok,
>
> My mind is being blown by CouchDB :D
>
> So I've realized that having a few databases per user is a really great idea
> if you decide to scale by decentralizing your content (clients do the heavy
> lifting by running queries on their local couchdb instances - since you're
> replicating to the client you don't really care that a lot of data is
> getting copied around). Every user has their own view of the world, and the
> server CouchDB instance is really only for dealing with content shared
> between users.

David,

I've definitely considered this mode of operation. I think the
simplest way to model a Twitter-like message distribution is to have
multiple databases for each user as well as a global firehose.

Let's assume the user is browsing against a CouchDB running on their laptop.

To publish a message, the user saves a document to a "publish"
database. At replication time this is pushed to the global firehose.

Users pull from the firehose(s) using filtered replication to only
copy docs authored by people they're interested in.

The user can also maintain an inbox in the cloud. Eg the host of the
global firehose (or other hosts) can maintain a database that people
can write to but not read from. Then as a user I can send direct
messages to other people's inboxes, which they will see at replication
time.

The user can also maintain a public replication of the "publish"
database (unmerged from the firehose), which could include just
remarks written by the user, but optionally include other messages the
user has seen and saved in the publish db.

>
> In our application (http://shiftspace.org), I'm thinking something along
> these lines:
>
> Client's laptop CouchDB instance:
> user/private - all the documents a user has created plus replicated content
> from groups and whatnot
> user/public - all the user's public content
> user/inbox - short messages
>

Your layout make sense.

I'd make the private database it's own db, and make another db for
replicated content. I'm thinking the private data is gonna be
important things like medical records and stuff, so I don't want to
just mix it with everything else.

You also want to make sure there is a publicly write-able inbox
database for each user in the cloud, so users can send each other
direct messages.

> Server CouchDB instance:
> group/x - group dbs of shared content. Replicated downstream to individual
> user/private who belong to the group
> master - user/public dbs replicated upstream to here.
>
> So my question is this. When a user publishes a document, it is written to
> user/private. If the user publishes a document to the world, we make a copy
> of it in user/public - it's just the same data minus the _rev field.
> Whenever a user updates a public document, we update the user/private copy
> as well as the the user/public copy which will be replicated upstream to the
> server.

This is why the user should save to the publish db, (eg instead of the
"drafts" db), and let replication send the publish db to the firehose.
Then it becomes clear that the private db is only for data I want to
avoid replicating except very carefully.

>
> So my question for the CouchDB gurus, will creating copies of documents in
> this manner create potential problems?
>

Technically you can do what you're shooting for, but it might be
better to use replication instead of saving to multiple dbs. I've been
thinking the replicator deserves an option to specify an array of
docids to replicate, which could be useful in this application.

Glad to help,

Chris

> Thanks much,
> David
>



-- 
Chris Anderson
http://jchrisa.net
http://couch.io