You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@couchdb.apache.org by Joscha Feth <jo...@feth.com> on 2010/01/05 18:26:34 UTC

Modeling question

Hello List,

the following scenario:

CouchDB is used as storage for hierarchical data in three levels:
* There are A's which are root elements
* There are B's which belong to A's. Any B knows (id) its (one and only) 
parent A.
* There are C's which belong to B's. Any C knows (id) its (one and only 
parent B and parent A).

All of them (A's, B's and C's) may contain a simplified access list 
(user can read (1), user can write(2), user is not allowed to access the 
document (0), default).
Now if a user has read access to an A, he can also read all B's and 
therefore also all C's belonging to that specific A.
Now it could be the case that a user can read A, but has additional 
rights on a specific sub-element (may be either a B or a C).
Users want to not only read/change A's only, but also B's and C's, not 
necessarily knowing the parenting A or B before the request.

Now my question is: how can I enforce that type of access control.
There are multiple possibilities which come on my mind, but all have a 
major drawback - I hope some of you guys have an idea.

Possibility A)
I create a view which contains all documents and create a list, which 
runs through all requested elements, checking whether the user has the 
access he wants. Drawback of this is, that I need to loop over *all* 
elements within CouchDB on *every* request, even if the user only wants 
to read a certain A, B or C until I found the requested element.
This possibility actually sounds very bad from a performance perspective.

Possibility B)
I put some sort of middleware between the application and couchDB, which 
fetches the requested document, reads the belonging parents from CouchDB 
in a subsequent request and then decides whether to return the fetched 
document/do the change or not. Drawback here is the additional request 
and the need for buffering the first request until it can be decided 
whether to return it or not OR to make a third piped request when the 
user has the according rights.
What I don't like here is, that not actually only writes are getting 
slower, but also the reads. And all reads to CouchDB triple (writes double).

regards,
Joscha


Re: Modeling question

Posted by Brian Candler <B....@pobox.com>.
On Thu, Jan 07, 2010 at 09:29:49PM +0100, Joscha Feth wrote:
> > I've tried it out and added some documentation here:
> > http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views#Linked_documents
> 
> this is exactly what I was searching for...cool, really really cool,
> this is a bit like the parent relationships in Bigtable - does anyone
> know since when this exists in trunk?

commit 333e010218a6cb9681672d3f6a52923a65c353b7
Author: jchris <jc...@13f79535-47bb-0310-9956-ffa450edef68>
Date:   Wed Sep 16 22:04:18 2009 +0000

    include_docs now take an _id (as well as a _rev) in the emitted value, to load docs other than the one doing the emitting. This means you can have one doc list a set of other docs to load in a single query. Enjoy!
    
    
    git-svn-id: http://svn.apache.org/repos/asf/couchdb/trunk@815984 13f79535-47bb-0310-9956-ffa450edef68

Re: Modeling question

Posted by Chris Anderson <jc...@apache.org>.
On Thu, Jan 7, 2010 at 12:29 PM, Joscha Feth <jo...@feth.com> wrote:
> On 07.01.10 11:27, Brian Candler wrote:
>> On Thu, Jan 07, 2010 at 09:45:33AM +0000, Brian Candler wrote:
>>> (*) There is a new feature in CouchDB, which I haven't tried, which allows a
>>> view to emit different documents than the one being processed.
>>
>> I've tried it out and added some documentation here:
>> http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views#Linked_documents
>
> this is exactly what I was searching for...cool, really really cool,
> this is a bit like the parent relationships in Bigtable - does anyone
> know since when this exists in trunk?

This will be new in 0.11

>
> regards,
> Joscha
>
>



-- 
Chris Anderson
http://jchrisa.net
http://couch.io

Re: Modeling question

Posted by Joscha Feth <jo...@feth.com>.
On 07.01.10 11:27, Brian Candler wrote:
> On Thu, Jan 07, 2010 at 09:45:33AM +0000, Brian Candler wrote:
>> (*) There is a new feature in CouchDB, which I haven't tried, which allows a
>> view to emit different documents than the one being processed.
> 
> I've tried it out and added some documentation here:
> http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views#Linked_documents

this is exactly what I was searching for...cool, really really cool,
this is a bit like the parent relationships in Bigtable - does anyone
know since when this exists in trunk?

regards,
Joscha


Re: Modeling question

Posted by Brian Candler <B....@pobox.com>.
On Thu, Jan 07, 2010 at 09:45:33AM +0000, Brian Candler wrote:
> (*) There is a new feature in CouchDB, which I haven't tried, which allows a
> view to emit different documents than the one being processed.

I've tried it out and added some documentation here:
http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views#Linked_documents

It looks like it could be useful for you.

As far as I can tell, it only affects include_docs=true. That is, you still
cannot include data from a different document directly into the keys or
values of a view row, but you can just cause a different document to be
loaded when include_docs=true is used.

Regards,

Brian.

Re: Modeling question

Posted by Brian Candler <B....@pobox.com>.
On Tue, Jan 05, 2010 at 07:44:51PM +0100, Joscha Feth wrote:
> >If you store the full path (A, B) on the Cs as well, then you can collate like:
> >
> >A,
> >A, B
> >A, B, C
> >A, B2
> >A, B2, C
> >A, B2, C2
> >
> >etc.
> >
> >If you are paginating through large directories, then you'll want to
> >store the parent metadata in client state, so you don't have to reload
> >it for each page.
> 
> I don't see how that can help when a user requests a C without
> knowing that it is a C and has a parent B and A.

You could emit the 'reverse' path in the view: so when you search for C
you'd actually look for [C] -> [C,{},{}] and you would find

[C2, B2, A]

But if C2 itself denies access, then it would still be up to your
application to determine whether the user has access to B2 and/or to A which
might in turn permit them to access C, which would involve further queries.
(*)

I think if you want to *enforce* this sort of access control, you're bound
to have middleware.  Otherwise, if you're storing the ACL for C2 within
document C2, then you have to trust the client to read C2, check the ACL,
find out that they're not allowed to read the rest of the document, and
ignore the rest of it.

Another option is: would it be possible to combine A, all its child B's, and
all their child C's into the same document? You can then use a view to spit
out the keys and access rights for each component. (i.e. a view which walks
the document and emits multiple k,v pairs)

I'm assuming here that you're going to block direct access to _view (e.g. in
a http proxy), and force all access through _list. Otherwise a user will be
able to look at the whole key range, and use ?include_docs=true to view the
entire source document.

> I can not use a database per user or group, as both are completely
> dynamic (e.g. a user may gain or loose access on some documents any
> time).

Fine-grained read access control is not provided within CouchDB, not per
document and certainly not for data within views.  Middleware is your best
bet, and _list is the closest that CouchDB provides to middleware if you
want to avoid a separate tier.

HTH,

Brian.

(*) There is a new feature in CouchDB, which I haven't tried, which allows a
view to emit different documents than the one being processed. That is, if a
document C2 'points' in some way to the docids for B2 and A, then when
mapping document C2 you can also emit docs B2 and A. This means that the
parent documents can be made adjacent to the target document in the view,
which means your list function can have the data to hand to make the union
access rights. At least, I think this is possible.

Looking at
share/www/script/test/view_include_docs.js 
I think you do something like
emit(key, {'_id': 'B'})
emit(key, {'_id': 'A'})

Re: Modeling question

Posted by Joscha Feth <jo...@feth.com>.
On 05.01.10 19:52, Chris Anderson wrote:
>> I don't see how that can help when a user requests a C without knowing that
>> it is a C and has a parent B and A. I still need to either allow access or
>> deny it. I don't see how something like this is possible with CouchDB out of
>> the box, as I don't have the ability to make another request.
>
> This is why modeling as a key/value store will be simpler. If you
> store the permissions on C itself, A and B become irrelevant.

Yes, but that would mean that I can not have any inheritance from A->C 
and also it would mean that if someone changes access on A or B I need 
to distribute that change to possibly millions of C's to keep them in sync.

regards,
Joscha


Re: Modeling question

Posted by Chris Anderson <jc...@apache.org>.
On Tue, Jan 5, 2010 at 10:44 AM, Joscha Feth <jo...@feth.com> wrote:
> Hi Chris,
>
> On 05.01.10 19:32, Chris Anderson wrote:
>>
>> If you store the full path (A, B) on the Cs as well, then you can collate
>> like:
>>
>> A,
>> A, B
>> A, B, C
>> A, B2
>> A, B2, C
>> A, B2, C2
>>
>> etc.
>>
>> If you are paginating through large directories, then you'll want to
>> store the parent metadata in client state, so you don't have to reload
>> it for each page.
>
> I don't see how that can help when a user requests a C without knowing that
> it is a C and has a parent B and A. I still need to either allow access or
> deny it. I don't see how something like this is possible with CouchDB out of
> the box, as I don't have the ability to make another request.

This is why modeling as a key/value store will be simpler. If you
store the permissions on C itself, A and B become irrelevant.

>
>
>> Without knowing more about the actual requirements, I can't argue this
>> point, but I've seen my share of cases that were simplified greatly by
>> moving to a key-value model. For instance, S3 has a per-bucket
>> security model, which is very similar to CouchDB's planned
>> per-database read control model.
>>
>> If you use a database-per-user (or group), quota calculations will
>> become much easier (and more realistic, as they account for actual
>> disk space used, not just total stored file size). You also get a
>> bunch of other simplifications (like the ability to replicate data
>> independently, also you don't have to worry so much about fine-grained
>> access control.) CouchDB can handle millions of databases on a single
>> node.
>
> I can not use a database per user or group, as both are completely dynamic
> (e.g. a user may gain or loose access on some documents any time). But I
> could use a separate database for any A (and storing all B's and C's within
> it) - would that help? I can't see how currently...
>
> regards,
> Joscha
>
>



-- 
Chris Anderson
http://jchrisa.net
http://couch.io

Re: Modeling question

Posted by Joscha Feth <jo...@feth.com>.
Hi Chris,

On 05.01.10 19:32, Chris Anderson wrote:
> If you store the full path (A, B) on the Cs as well, then you can collate like:
>
> A,
> A, B
> A, B, C
> A, B2
> A, B2, C
> A, B2, C2
>
> etc.
>
> If you are paginating through large directories, then you'll want to
> store the parent metadata in client state, so you don't have to reload
> it for each page.

I don't see how that can help when a user requests a C without knowing 
that it is a C and has a parent B and A. I still need to either allow 
access or deny it. I don't see how something like this is possible with 
CouchDB out of the box, as I don't have the ability to make another request.


> Without knowing more about the actual requirements, I can't argue this
> point, but I've seen my share of cases that were simplified greatly by
> moving to a key-value model. For instance, S3 has a per-bucket
> security model, which is very similar to CouchDB's planned
> per-database read control model.
>
> If you use a database-per-user (or group), quota calculations will
> become much easier (and more realistic, as they account for actual
> disk space used, not just total stored file size). You also get a
> bunch of other simplifications (like the ability to replicate data
> independently, also you don't have to worry so much about fine-grained
> access control.) CouchDB can handle millions of databases on a single
> node.

I can not use a database per user or group, as both are completely 
dynamic (e.g. a user may gain or loose access on some documents any 
time). But I could use a separate database for any A (and storing all 
B's and C's within it) - would that help? I can't see how currently...

regards,
Joscha


Re: Modeling question

Posted by Chris Anderson <jc...@apache.org>.
On Tue, Jan 5, 2010 at 10:01 AM, Joscha Feth <jo...@feth.com> wrote:
> On 05.01.10 18:49, Chris Anderson wrote:
>
>> if you collate correctly and use startkey and endkey you will be be
>> able to read just the relevant rows from the view. This should be
>> practically instantaneous, so it's probably the right solution.
>
> Actually this is only true for A's (I do this already), as any B or C does
> not necessarily have its own access field.
> As it is not possible to temporary store variables in a map function, there
> is no way I can distribute the acl from a parent element (such as A or B)
> into a child element (B or C).
> I could misuse reduce probably, but as this is discouraged, I did not even
> try.
> Also an A does not have any information about its children, so when emitting
> an A (which will most likely contain the access information), I can not also
> emit that information for any child (B or C).
> This makes using starkey and endkey only suitable for A's but not for any
> child element in the hierarchy.
>

If you store the full path (A, B) on the Cs as well, then you can collate like:

A,
A, B
A, B, C
A, B2
A, B2, C
A, B2, C2

etc.

If you are paginating through large directories, then you'll want to
store the parent metadata in client state, so you don't have to reload
it for each page.

> Any other ideas?
>
>> I mentioned before, but I'll mention again - if there's a way to
>> achieve your business case without modeling a hierarchy (always an
>> impedance mismatch with key value stores) you will simplify a lot of
>> things. Not that it's impossible to do a hierarchy, but if you can get
>> away without it, you'll have a lot less on your plate.
>
> I've read that multiple times, but you know there are these times when a
> hierarchy just is needed.

Without knowing more about the actual requirements, I can't argue this
point, but I've seen my share of cases that were simplified greatly by
moving to a key-value model. For instance, S3 has a per-bucket
security model, which is very similar to CouchDB's planned
per-database read control model.

If you use a database-per-user (or group), quota calculations will
become much easier (and more realistic, as they account for actual
disk space used, not just total stored file size). You also get a
bunch of other simplifications (like the ability to replicate data
independently, also you don't have to worry so much about fine-grained
access control.) CouchDB can handle millions of databases on a single
node.

Chris

>
> regards,
> Joscha
>
>



-- 
Chris Anderson
http://jchrisa.net
http://couch.io

Re: Modeling question

Posted by Joscha Feth <jo...@feth.com>.
On 05.01.10 18:49, Chris Anderson wrote:

> if you collate correctly and use startkey and endkey you will be be
> able to read just the relevant rows from the view. This should be
> practically instantaneous, so it's probably the right solution.

Actually this is only true for A's (I do this already), as any B or C 
does not necessarily have its own access field.
As it is not possible to temporary store variables in a map function, 
there is no way I can distribute the acl from a parent element (such as 
A or B) into a child element (B or C).
I could misuse reduce probably, but as this is discouraged, I did not 
even try.
Also an A does not have any information about its children, so when 
emitting an A (which will most likely contain the access information), I 
can not also emit that information for any child (B or C).
This makes using starkey and endkey only suitable for A's but not for 
any child element in the hierarchy.

Any other ideas?

> I mentioned before, but I'll mention again - if there's a way to
> achieve your business case without modeling a hierarchy (always an
> impedance mismatch with key value stores) you will simplify a lot of
> things. Not that it's impossible to do a hierarchy, but if you can get
> away without it, you'll have a lot less on your plate.

I've read that multiple times, but you know there are these times when a 
hierarchy just is needed.

regards,
Joscha


Re: Modeling question

Posted by Chris Anderson <jc...@apache.org>.
On Tue, Jan 5, 2010 at 9:26 AM, Joscha Feth <jo...@feth.com> wrote:
> Hello List,
>
> the following scenario:
>
> CouchDB is used as storage for hierarchical data in three levels:
> * There are A's which are root elements
> * There are B's which belong to A's. Any B knows (id) its (one and only)
> parent A.
> * There are C's which belong to B's. Any C knows (id) its (one and only
> parent B and parent A).
>
> All of them (A's, B's and C's) may contain a simplified access list (user
> can read (1), user can write(2), user is not allowed to access the document
> (0), default).
> Now if a user has read access to an A, he can also read all B's and
> therefore also all C's belonging to that specific A.
> Now it could be the case that a user can read A, but has additional rights
> on a specific sub-element (may be either a B or a C).
> Users want to not only read/change A's only, but also B's and C's, not
> necessarily knowing the parenting A or B before the request.
>
> Now my question is: how can I enforce that type of access control.
> There are multiple possibilities which come on my mind, but all have a major
> drawback - I hope some of you guys have an idea.
>
> Possibility A)
> I create a view which contains all documents and create a list, which runs
> through all requested elements, checking whether the user has the access he
> wants. Drawback of this is, that I need to loop over *all* elements within
> CouchDB on *every* request, even if the user only wants to read a certain A,
> B or C until I found the requested element.
> This possibility actually sounds very bad from a performance perspective.
>

if you collate correctly and use startkey and endkey you will be be
able to read just the relevant rows from the view. This should be
practically instantaneous, so it's probably the right solution.

I mentioned before, but I'll mention again - if there's a way to
achieve your business case without modeling a hierarchy (always an
impedance mismatch with key value stores) you will simplify a lot of
things. Not that it's impossible to do a hierarchy, but if you can get
away without it, you'll have a lot less on your plate.

> Possibility B)
> I put some sort of middleware between the application and couchDB, which
> fetches the requested document, reads the belonging parents from CouchDB in
> a subsequent request and then decides whether to return the fetched
> document/do the change or not. Drawback here is the additional request and
> the need for buffering the first request until it can be decided whether to
> return it or not OR to make a third piped request when the user has the
> according rights.
> What I don't like here is, that not actually only writes are getting slower,
> but also the reads. And all reads to CouchDB triple (writes double).
>
> regards,
> Joscha
>
>



-- 
Chris Anderson
http://jchrisa.net
http://couch.io