You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@couchdb.apache.org by Jan Lehnardt <ja...@apache.org> on 2019/03/10 14:51:14 UTC

Re: [DISCUSS] Per-doc access control

Hey all,

after mulling this over some more, I’d like to tackle the detailed API and behaviour for this. Especially how _access work in conjunction with existing access control features.

My guiding principles so far are:

1. Make the API intuitive, things should work like they look like they should work like.
2. The default should never be that a resources is accidentally left accessible to the public.
3. This should work as a natural extension to the existing security features*.

* I’d be up for reworking the whole lot, too, but that might be a better discussion for > 4.0.

## Database Creation and Default Behaviours

Creating a database with _access features is, as mentioned before done via a flag to PUT /database?access=true

In a 3.0 world where this would land, we already agreed that databases should be admin-only by default (instead of world read/writeable today). This is a sensible default, but that leaves us with an _access enabled database that can’t be used by anyone by server or db admins. Not very useful.

To allow arbitrary users to use the db, I suggest we use the existing _security system: i.e. if a user or a group a user belongs to is mentioned in either `admins` or `members` inside of _security, they can proceed and create documents on the db. This puts a second step burden on the application developer, but it slots cleanly into the existing security mechanisms, and doesn’t require special case handling. Alternatively, we could define that _security isn’t available in _access enabled databases, but that’s something I’d like to avoid if at all possible.

In order to make it easy to specify that “everyone in _users” should be able to use the db, I suggest we add a new role `_users` that is valid inside _security, which means “everyone in /_users” (this only excludes server admins which have full access anyway).

* * *

## Document Creation and Access Control

Next, one of our non-admin users creates a doc. There are multiple options as to how we store the _access information.

1. Automatically translate the userCtx.name of a doc creation (not an update) into the first element of the _access array. E.g. user_a PUT /db/doc {"a":1} creates this doc: {"a":1,"_access":["user_a"]}. This is a little bit counter-intuitive.

2. We require that a user puts "_access":["user_a"] in themselves. This is an explicit granting of access permissions on doc creation and I think is preferable.

This leaves the edge case of docs that have no _access member: so far I thought those docs are admin-only, with maybe a db-wide option to swap the default to public access, but I think given the explicitness of 2. we can do better: require _access for all new doc creations in access-enabled databases. A user can not create a new document without an _access field that is an array that has at least one member. For public documents, we could invent a new role _public, and admin-only docs could use the existing role _admin.

The one downside to this approach is that we won’t be able to replicate existing databases into an access-enabled database without modifying all documents. This might be a worthwhile trade-off, but we should make that decision consciously and document it well. We could allow for a special case where an _admin user can create docs that have no _access field, and those docs are treated as having only the _admin role in _access. So at least we could replicate all data in, but then require a manual step to update all docs to say, migrate an existing db-per-user app, while not accidentally exposing any docs to folks that shouldn’t read them.

For the rest of cRUD, the existing document must store one of the RUD-ing user’s name or role in its _access field.

For both creations and updates, a user MUST supply at least one role they belong to or their own username.

* * *

## _revs_diff

/db/_revs_diff can answer the question of which revisions of a document do NOT exist on a replication target: http://docs.couchdb.org/en/stable/api/database/misc.html#db-revs-diff

This would allow users to specify ids and rev(s) for docs they don’t have access too (anymore), so the result schema should be expanded to handle id: unauthorized or somesuch, something the replicator needs to know what to do with, if it encounters it (say a user got removed from the _access list inbetween the replicator opening _changes and requesting the doc).

The _revs_diff implementation would have to altered to send an unauthorized token for each doc the requesting userCtx has no access to. If we can re-use some of our existing indexes, or any other performance optimisation, that’d be great. I haven’t looked at that code at all, yet.

An important side-effect of this is, once a user has been added to a doc’s _access list, they get access to “the full history of the doc”, even before they had access. Of course, in CouchDB this means only getting access to the rev ids, and not the content, but since they are content-addressable hashes, a user could brute-force themselves into revealing certain real values from earlier incarnations of the doc. I’d rather not track _access per document revision in perpetuity, so this is something we have to be very up-front about.

* * *

## Partitioned Databases

I mentioned partitioned databases in my previous mail, and I think it is something we can document that end-users can opt into, but doesn’t require any special casing on the _access proposal. That is, if users start prefixing their doc ids with a user name or id and enable both _access and partitions, then they get all the benefits of a partitioned database, and if they choose not to, they don’t, but things keep working. There are enough use-cases to warrant both behaviours.

* * *

## Scenarios that _access should help with.

Overall, we developed _access to allow users to stop using the db-per-user architecture, but once we have per-doc-access control, folks might start using this for all manner of things. We should be clear about which scenarios we support and which we don’t.

### Scenario 1: db-per-user

In this scenario, _access enabled databases, the only way to allow mutually untrusting users to store data in a part of CouchDB that only they (and admins) have access to was giving each user their own database.

In an _access enabled database, users can CRUD/_changes/_all_docs/_revs_diff their own docs knowing no other user (aside from admins) can access those docs.

This is the simplest scenario, as all we’d have to track the owner of a document and produce by-access-id/seq indexes based on that owner.

The current prototype implementation mostly reflects this stage. Not saying this is what we should ship, but it is the easiest do implement and explain.

Aside, I might be able to be persuaded to ship this as a 2.x feature, to help those folks who don’t need anything else.

### Scenario 2: db-per-user + Sharing

The second we allow per doc auth, users will want to share those docs with other users. That’s why we initially suggested the _access field be an array, so other users and groups can be specified to have access. There are multiple scenarios in this one alone:

#### 2.1: The Todo List

In this scenario, a user has a reasonable amount of ”personal data” that they want to selectively share with one or more other users.

#### 2.2: The Chat/Forum/Newsgroup

In this scenario, a user wants to share any number of documents with a reasonable number of groups. However, since we need to limit the number of groups a user belongs to (currently 10, see below for details), this might actually not be a great solution. Or folks couldn’t be in more than 10 chat groups at a time.

#### 2.3: The Corporate Hierarchy

In this scenario, users want to share any number of docs with a reasonable number of groups in a top-down/bottom-up fashion. Think CEO shares with executives, execs share with divisions, divisions report up to their one executive, etc.

### 3: Multiple Apps

The preceding scenarios all assume that a single application is responsible for everything. However, once we allow mutually distrusting users into a single database *and* make each per-user slice work (almost) like a full standalone CouchDB database, what would stop users from using this for a multi-homing feature, where different applications are used for each user in the same database?

I’ll be referring to these scenarios down the line.

* * *

## Design Docs

### Admin

One of the downsides of db-per-user is managing design docs in the face of a changing application, that is, how to distribute new design docs across 10s of 1000+s of user dbs? It’s not impossible, but tedious. In all scenarios above but scenario 3., we could simplify this significantly. Say an admin creates a design doc, and gives all users in the db access to this design doc (this could be with the _users role, or yet another new role _members, if we need it), requesting the result of a view defined in that design doc will produce an index that is powered by the requesting user’s by-access-seq index section(s).

N.B., this would require us to change a fundamental assumption when doing the association between a design doc’s definition and index: normally, there is only the `views` member that is hashed and that hash is used as the index’s filename. Because there is only by-seq to power a view, that all works. But now that we have an arbitrary set of sections on by-access-seq, any view index built will have to take a user’s name and roles into account. When a user leaves a group, or gains a group, all indexes for that user will no longer be valid and need rebuilding.

### User

In any of the scenarios above, but especially 3., there could be legitimate per-user design docs, so how should those be treated in an _access enabled database?

The significant fields in a design doc are `views`, `validate_doc_update` and `filters` (I’ll skip over the deprecated _show, _list, and _update).

The easiest to handle is a `filters`: if a user specifies a filter for a _changes request or replication that lives in a design doc they don’t have access to, they get an error, similar to if they specify a non-existent design doc, just with `unauthorized` instead of `not_found`.

Next `views` is also not very hard to imagine working: just like globally defined views for that db, the index is built for each user based on the user’s name and roles.

More troubling are `validate_doc_update` functions: One, they are already troubling in that they slow down any document updates. Two, if we now import an existing db-per-user scenario where each user has their own design docs, how should we apply validate_doc_update functions? 10s of 1000s of VDUs are impractical to apply on each doc update, let alone just the management of VDUs that are active on a database. One option would be to ignore VDUs if they are not defined globally (say with a _members role). But especially in scenario 3. this becomes problematic, but even without that specific scenario, this violates the no surprises best practice.

We could say:

a) we don’t support scenario 3.
b) we find a complicated but efficient way to apply only those VDUs that are defined in design docs the writing user has access to plus any global ones (this would be neat but rather complicated and potentially still impractical from a performance perspective for N users).
c) we could store all per-user design docs, but ignore them completely, VDUs, views and filters.

I think I currently fall on the side of not supporting scenario 3. and asking folks who migrate db-per-user to de-duplicate design docs and keep them per-app. I believe that is a good trade-off between the most common scenarios for db-per-user while keeping the implementation manageable. Globally accessible design docs would show up in a user’s changes feed and would replicate down to say a PouchDB application which might be the exclusive user of those design docs.

In practice this would mean, a document that has an _id that starts with _design/ will have to be produced by a database admin. Luckily, that’s already the case. We should just make sure that folks don’t give db-admin access to all users habitually.

## Read and Write Access

Speaking of validate_doc_update, it is used for two things: checking document schema and doc update authorisation.

Once we allow access to a document with an _access field, we need to decide what kind of access this gives to a doc: read-only or read-write (I’m not considering write-only because for anything but doc creations this is not useful as you need access to the current _rev).

However, when we look at implementing an application on top of our existing API, it is already weird that read access can be controlled globally (or with _access on a per doc level), but write access requires writing JavaScript code. I think it would be a reasonable expectation for users to expect a per-doc read/write permission granting.

So we could have all of the above, but with two extra fields: _access_read and _access_write, or _access: {read: [], write: []} or we overload user and group names: _access: [user_a:read, user_b:write] (or any permutation thereof). Overloading can cause trouble with naturally occurring characters in group names.

The former seems more explicit, but from an API perspective that’s a little more awkward: remember that we currently have an arbitrary limit of 10 members in a user’s role array, to avoid excessive fan out on cluster-internal operations. Partitioned dbs could get away with more, more easily however. If we allow the specification of access control in two lists, and one of the lists implies membership in the other, we have a total limit of 10 members across both arrays. Or we limit 5 + 5, but that seems excessive, while 10 total seems weird, but doable. Anyway, good bikeshed.

* * * 

So far. I think all of the problems outlined are solvable, if with a clear definition of what use-cases we do not support with access. If you have more scenarios than the ones I outlined, please add them and we can see if they cause any additional trouble.

Thanks for reading this far and I’m looking forward to your feedback.

Best,
Jan “_access” Lehnardt
—

> On 17. Feb 2019, at 15:25, Jan Lehnardt <ja...@apache.org> wrote:
> 
> Hi Everyone,
> 
> I’m happy to share my work in progress attempt to implement the per-doc access control feature we discussed a good while ago:
> 
> https://lists.apache.org/thread.html/6aa77dd8e5974a3a540758c6902ccb509ab5a2e4802ecf4fd724a5e4@%3Cdev.couchdb.apache.org%3E <https://lists.apache.org/thread.html/6aa77dd8e5974a3a540758c6902ccb509ab5a2e4802ecf4fd724a5e4@%3Cdev.couchdb.apache.org%3E>
> 
> You can check out my branch here:
> 
> https://github.com/apache/couchdb/compare/access?expand=1 <https://github.com/apache/couchdb/compare/access?expand=1>
> 
> It is very much work in progress, but it is far enough along to warrant discussion.
> 
> The main point of this branch is to show all the places that we would need to change to support the proposal.
> 
> Things I’ve left for later:
> 
> - currently only the first element in the _access array is used. Our and/or syntax can be added later.
> - building per-access views has not been implemented yet, couch_index would have to be taught about the new per-access-id index.
> - pretty HTTP error handling
> - tests except for a tiny shell script 😇
> 
> Implementation notes:
> 
> You create a database with the _access feature turned on like so:  PUT /db?access=true
> 
> I started out with storing _access in the document body, as that would allow for a minimal change set, however, on doc updates, we try hard not to load the old doc body from the database, and forcing us to do so for EVERY doc update under _access seemed prohibitive, so I extended the #doc, #doc_info and #full_doc_info records with a new `access` attribute that is stored in both by-id and by-seq. I will need guidance on how extending these records impact multi-version cluster interop. And especially whether this is an acceptable approach.
> 
> https://github.com/apache/couchdb/compare/access?expand=1&ws=0#diff-904ab7473ff8ddd07ea44aca414e3a36
> 
> * * *
> 
> The main addition is a new native query server called couch_access_native_proc, which implements two new indexes by-access-id and by-access-seq which do what you’d expect, pass in a userCtx and retrieve the equivalent of _all_docs or _changes, but only including those docs that match the username and roles in their _access property. The existing handlers for _all_docs and _changes have been augmented to use the new indexes instead of the default ones, unless the user is an admin.
> 
> https://github.com/apache/couchdb/compare/access?expand=1&ws=0#diff-fbb53323f07579be5e46ba63cb6701c4
> 
> * * *
> 
> The rest of the diff is concerned with making document CRUD behave as you’d expect it. See this little demonstration for what things look like:
> 
> https://gist.github.com/janl/b6d3f7502aa20b7b9ab9d9dcb8e92497 <https://gist.github.com/janl/b6d3f7502aa20b7b9ab9d9dcb8e92497> (I’m just noticing that there might be something wonky with DELETE, but you’ll get the gist #rimshot)
> 
> * * *
> 
> Open questions:
> 
> - The aim of this is to get as close to regular CouchDB behaviour as possible. One thing that is new however which would require all apps to be changed is that for an _access enabled database to include an _access field in their docs (docs with no _access are admin-only for now). We might want to consider on new document writes to auto-insert the authenticated user’s name as the first element in the _access array, so existing apps “just work”.
> 
> - Interplay with partitioned dbs: eschewing db-per-user is already a large boon if you have a lot of users, but making those per-user requests inside an _access enabled database efficient would be doubly nice, so why not use the username from the first question above and use that as the partition key? This would work nicely for natural users with their own docs that want to share them with others later, but I can easily imagine a pipelined use of CouchDB, where a “collector” user creates all new docs, an “analyser” takes them over and hand them to a “result” user for viewing. In that case, we’d violate the high-cardinality rule of partitions (have a lot of small ones), instead all docs go through all three users. I’d be okay with treating the later scenario as a minor use-case, but for that use-case, we should be able to disable auto-partitioning on db creation.
> 
> - building access view indexes for docs that have frequent _access changes, lead to many orphaned view indexes, we should look at an auto-cleanup solution here (maybe keep 1-N indexes in case folks just swap back and forth).
> 
> * * *
> 
> I’ll leave this here for now, I’m sure there are a few more things to consider.
> 
> I’d love to hear any and all feedback you might have. Especially if anything is unclear.
> 
> Best
> Jan
> —

-- 
Professional Support for Apache CouchDB:
https://neighbourhood.ie/couchdb-support/

Re: [DISCUSS] Per-doc access control

Posted by Jan Lehnardt <ma...@jan.io>.

One addition, the slotting in of _access into existing security mechanisms is as follows:

1. Check if a user is in _security
2. If yes, check it user is in _access (modulo read/write)
3. If yes, does the doc update pass any globally defined VDUs
4. If yes, operation can proceed.

Cheers
Jan
—

> On 10. Mar 2019, at 15:51, Jan Lehnardt <ja...@apache.org> wrote:
> 
> Hey all,
> 
> after mulling this over some more, I’d like to tackle the detailed API and behaviour for this. Especially how _access work in conjunction with existing access control features.
> 
> My guiding principles so far are:
> 
> 1. Make the API intuitive, things should work like they look like they should work like.
> 2. The default should never be that a resources is accidentally left accessible to the public.
> 3. This should work as a natural extension to the existing security features*.
> 
> * I’d be up for reworking the whole lot, too, but that might be a better discussion for > 4.0.
> 
> 
> ## Database Creation and Default Behaviours
> 
> Creating a database with _access features is, as mentioned before done via a flag to PUT /database?access=true
> 
> In a 3.0 world where this would land, we already agreed that databases should be admin-only by default (instead of world read/writeable today). This is a sensible default, but that leaves us with an _access enabled database that can’t be used by anyone by server or db admins. Not very useful.
> 
> To allow arbitrary users to use the db, I suggest we use the existing _security system: i.e. if a user or a group a user belongs to is mentioned in either `admins` or `members` inside of _security, they can proceed and create documents on the db. This puts a second step burden on the application developer, but it slots cleanly into the existing security mechanisms, and doesn’t require special case handling. Alternatively, we could define that _security isn’t available in _access enabled databases, but that’s something I’d like to avoid if at all possible.
> 
> In order to make it easy to specify that “everyone in _users” should be able to use the db, I suggest we add a new role `_users` that is valid inside _security, which means “everyone in /_users” (this only excludes server admins which have full access anyway).
> 
> * * *
> 
> 
> ## Document Creation and Access Control
> 
> Next, one of our non-admin users creates a doc. There are multiple options as to how we store the _access information.
> 
> 1. Automatically translate the userCtx.name of a doc creation (not an update) into the first element of the _access array. E.g. user_a PUT /db/doc {"a":1} creates this doc: {"a":1,"_access":["user_a"]}. This is a little bit counter-intuitive.
> 
> 2. We require that a user puts "_access":["user_a"] in themselves. This is an explicit granting of access permissions on doc creation and I think is preferable.
> 
> This leaves the edge case of docs that have no _access member: so far I thought those docs are admin-only, with maybe a db-wide option to swap the default to public access, but I think given the explicitness of 2. we can do better: require _access for all new doc creations in access-enabled databases. A user can not create a new document without an _access field that is an array that has at least one member. For public documents, we could invent a new role _public, and admin-only docs could use the existing role _admin.
> 
> The one downside to this approach is that we won’t be able to replicate existing databases into an access-enabled database without modifying all documents. This might be a worthwhile trade-off, but we should make that decision consciously and document it well. We could allow for a special case where an _admin user can create docs that have no _access field, and those docs are treated as having only the _admin role in _access. So at least we could replicate all data in, but then require a manual step to update all docs to say, migrate an existing db-per-user app, while not accidentally exposing any docs to folks that shouldn’t read them.
> 
> For the rest of cRUD, the existing document must store one of the RUD-ing user’s name or role in its _access field.
> 
> For both creations and updates, a user MUST supply at least one role they belong to or their own username.
> 
> * * *
> 
> 
> ## _revs_diff
> 
> /db/_revs_diff can answer the question of which revisions of a document do NOT exist on a replication target: http://docs.couchdb.org/en/stable/api/database/misc.html#db-revs-diff
> 
> This would allow users to specify ids and rev(s) for docs they don’t have access too (anymore), so the result schema should be expanded to handle id: unauthorized or somesuch, something the replicator needs to know what to do with, if it encounters it (say a user got removed from the _access list inbetween the replicator opening _changes and requesting the doc).
> 
> The _revs_diff implementation would have to altered to send an unauthorized token for each doc the requesting userCtx has no access to. If we can re-use some of our existing indexes, or any other performance optimisation, that’d be great. I haven’t looked at that code at all, yet.
> 
> An important side-effect of this is, once a user has been added to a doc’s _access list, they get access to “the full history of the doc”, even before they had access. Of course, in CouchDB this means only getting access to the rev ids, and not the content, but since they are content-addressable hashes, a user could brute-force themselves into revealing certain real values from earlier incarnations of the doc. I’d rather not track _access per document revision in perpetuity, so this is something we have to be very up-front about.
> 
> * * *
> 
> 
> ## Partitioned Databases
> 
> I mentioned partitioned databases in my previous mail, and I think it is something we can document that end-users can opt into, but doesn’t require any special casing on the _access proposal. That is, if users start prefixing their doc ids with a user name or id and enable both _access and partitions, then they get all the benefits of a partitioned database, and if they choose not to, they don’t, but things keep working. There are enough use-cases to warrant both behaviours.
> 
> * * *
> 
> 
> ## Scenarios that _access should help with.
> 
> Overall, we developed _access to allow users to stop using the db-per-user architecture, but once we have per-doc-access control, folks might start using this for all manner of things. We should be clear about which scenarios we support and which we don’t.
> 
> 
> ### Scenario 1: db-per-user
> 
> In this scenario, _access enabled databases, the only way to allow mutually untrusting users to store data in a part of CouchDB that only they (and admins) have access to was giving each user their own database.
> 
> In an _access enabled database, users can CRUD/_changes/_all_docs/_revs_diff their own docs knowing no other user (aside from admins) can access those docs.
> 
> This is the simplest scenario, as all we’d have to track the owner of a document and produce by-access-id/seq indexes based on that owner.
> 
> The current prototype implementation mostly reflects this stage. Not saying this is what we should ship, but it is the easiest do implement and explain.
> 
> Aside, I might be able to be persuaded to ship this as a 2.x feature, to help those folks who don’t need anything else.
> 
> 
> ### Scenario 2: db-per-user + Sharing
> 
> The second we allow per doc auth, users will want to share those docs with other users. That’s why we initially suggested the _access field be an array, so other users and groups can be specified to have access. There are multiple scenarios in this one alone:
> 
> #### 2.1: The Todo List
> 
> In this scenario, a user has a reasonable amount of ”personal data” that they want to selectively share with one or more other users.
> 
> #### 2.2: The Chat/Forum/Newsgroup
> 
> In this scenario, a user wants to share any number of documents with a reasonable number of groups. However, since we need to limit the number of groups a user belongs to (currently 10, see below for details), this might actually not be a great solution. Or folks couldn’t be in more than 10 chat groups at a time.
> 
> #### 2.3: The Corporate Hierarchy
> 
> In this scenario, users want to share any number of docs with a reasonable number of groups in a top-down/bottom-up fashion. Think CEO shares with executives, execs share with divisions, divisions report up to their one executive, etc.
> 
> 
> ### 3: Multiple Apps
> 
> The preceding scenarios all assume that a single application is responsible for everything. However, once we allow mutually distrusting users into a single database *and* make each per-user slice work (almost) like a full standalone CouchDB database, what would stop users from using this for a multi-homing feature, where different applications are used for each user in the same database?
> 
> I’ll be referring to these scenarios down the line.
> 
> * * *
> 
> 
> ## Design Docs
> 
> ### Admin
> 
> One of the downsides of db-per-user is managing design docs in the face of a changing application, that is, how to distribute new design docs across 10s of 1000+s of user dbs? It’s not impossible, but tedious. In all scenarios above but scenario 3., we could simplify this significantly. Say an admin creates a design doc, and gives all users in the db access to this design doc (this could be with the _users role, or yet another new role _members, if we need it), requesting the result of a view defined in that design doc will produce an index that is powered by the requesting user’s by-access-seq index section(s).
> 
> N.B., this would require us to change a fundamental assumption when doing the association between a design doc’s definition and index: normally, there is only the `views` member that is hashed and that hash is used as the index’s filename. Because there is only by-seq to power a view, that all works. But now that we have an arbitrary set of sections on by-access-seq, any view index built will have to take a user’s name and roles into account. When a user leaves a group, or gains a group, all indexes for that user will no longer be valid and need rebuilding.
> 
> 
> ### User
> 
> In any of the scenarios above, but especially 3., there could be legitimate per-user design docs, so how should those be treated in an _access enabled database?
> 
> The significant fields in a design doc are `views`, `validate_doc_update` and `filters` (I’ll skip over the deprecated _show, _list, and _update).
> 
> The easiest to handle is a `filters`: if a user specifies a filter for a _changes request or replication that lives in a design doc they don’t have access to, they get an error, similar to if they specify a non-existent design doc, just with `unauthorized` instead of `not_found`.
> 
> Next `views` is also not very hard to imagine working: just like globally defined views for that db, the index is built for each user based on the user’s name and roles.
> 
> More troubling are `validate_doc_update` functions: One, they are already troubling in that they slow down any document updates. Two, if we now import an existing db-per-user scenario where each user has their own design docs, how should we apply validate_doc_update functions? 10s of 1000s of VDUs are impractical to apply on each doc update, let alone just the management of VDUs that are active on a database. One option would be to ignore VDUs if they are not defined globally (say with a _members role). But especially in scenario 3. this becomes problematic, but even without that specific scenario, this violates the no surprises best practice.
> 
> We could say:
> 
> a) we don’t support scenario 3.
> b) we find a complicated but efficient way to apply only those VDUs that are defined in design docs the writing user has access to plus any global ones (this would be neat but rather complicated and potentially still impractical from a performance perspective for N users).
> c) we could store all per-user design docs, but ignore them completely, VDUs, views and filters.
> 
> I think I currently fall on the side of not supporting scenario 3. and asking folks who migrate db-per-user to de-duplicate design docs and keep them per-app. I believe that is a good trade-off between the most common scenarios for db-per-user while keeping the implementation manageable. Globally accessible design docs would show up in a user’s changes feed and would replicate down to say a PouchDB application which might be the exclusive user of those design docs.
> 
> In practice this would mean, a document that has an _id that starts with _design/ will have to be produced by a database admin. Luckily, that’s already the case. We should just make sure that folks don’t give db-admin access to all users habitually.
> 
> 
> ## Read and Write Access
> 
> Speaking of validate_doc_update, it is used for two things: checking document schema and doc update authorisation.
> 
> Once we allow access to a document with an _access field, we need to decide what kind of access this gives to a doc: read-only or read-write (I’m not considering write-only because for anything but doc creations this is not useful as you need access to the current _rev).
> 
> However, when we look at implementing an application on top of our existing API, it is already weird that read access can be controlled globally (or with _access on a per doc level), but write access requires writing JavaScript code. I think it would be a reasonable expectation for users to expect a per-doc read/write permission granting.
> 
> So we could have all of the above, but with two extra fields: _access_read and _access_write, or _access: {read: [], write: []} or we overload user and group names: _access: [user_a:read, user_b:write] (or any permutation thereof). Overloading can cause trouble with naturally occurring characters in group names.
> 
> The former seems more explicit, but from an API perspective that’s a little more awkward: remember that we currently have an arbitrary limit of 10 members in a user’s role array, to avoid excessive fan out on cluster-internal operations. Partitioned dbs could get away with more, more easily however. If we allow the specification of access control in two lists, and one of the lists implies membership in the other, we have a total limit of 10 members across both arrays. Or we limit 5 + 5, but that seems excessive, while 10 total seems weird, but doable. Anyway, good bikeshed.
> 
> 
> * * * 
> 
> 
> So far. I think all of the problems outlined are solvable, if with a clear definition of what use-cases we do not support with access. If you have more scenarios than the ones I outlined, please add them and we can see if they cause any additional trouble.
> 
> Thanks for reading this far and I’m looking forward to your feedback.
> 
> 
> Best,
> Jan “_access” Lehnardt
> —
> 
> 
> 
> 
>> On 17. Feb 2019, at 15:25, Jan Lehnardt <ja...@apache.org> wrote:
>> 
>> Hi Everyone,
>> 
>> I’m happy to share my work in progress attempt to implement the per-doc access control feature we discussed a good while ago:
>> 
>> https://lists.apache.org/thread.html/6aa77dd8e5974a3a540758c6902ccb509ab5a2e4802ecf4fd724a5e4@%3Cdev.couchdb.apache.org%3E <https://lists.apache.org/thread.html/6aa77dd8e5974a3a540758c6902ccb509ab5a2e4802ecf4fd724a5e4@%3Cdev.couchdb.apache.org%3E>
>> 
>> You can check out my branch here:
>> 
>> https://github.com/apache/couchdb/compare/access?expand=1 <https://github.com/apache/couchdb/compare/access?expand=1>
>> 
>> It is very much work in progress, but it is far enough along to warrant discussion.
>> 
>> The main point of this branch is to show all the places that we would need to change to support the proposal.
>> 
>> Things I’ve left for later:
>> 
>> - currently only the first element in the _access array is used. Our and/or syntax can be added later.
>> - building per-access views has not been implemented yet, couch_index would have to be taught about the new per-access-id index.
>> - pretty HTTP error handling
>> - tests except for a tiny shell script 😇
>> 
>> Implementation notes:
>> 
>> You create a database with the _access feature turned on like so:  PUT /db?access=true
>> 
>> I started out with storing _access in the document body, as that would allow for a minimal change set, however, on doc updates, we try hard not to load the old doc body from the database, and forcing us to do so for EVERY doc update under _access seemed prohibitive, so I extended the #doc, #doc_info and #full_doc_info records with a new `access` attribute that is stored in both by-id and by-seq. I will need guidance on how extending these records impact multi-version cluster interop. And especially whether this is an acceptable approach.
>> 
>> https://github.com/apache/couchdb/compare/access?expand=1&ws=0#diff-904ab7473ff8ddd07ea44aca414e3a36
>> 
>> * * *
>> 
>> The main addition is a new native query server called couch_access_native_proc, which implements two new indexes by-access-id and by-access-seq which do what you’d expect, pass in a userCtx and retrieve the equivalent of _all_docs or _changes, but only including those docs that match the username and roles in their _access property. The existing handlers for _all_docs and _changes have been augmented to use the new indexes instead of the default ones, unless the user is an admin.
>> 
>> https://github.com/apache/couchdb/compare/access?expand=1&ws=0#diff-fbb53323f07579be5e46ba63cb6701c4
>> 
>> * * *
>> 
>> The rest of the diff is concerned with making document CRUD behave as you’d expect it. See this little demonstration for what things look like:
>> 
>> https://gist.github.com/janl/b6d3f7502aa20b7b9ab9d9dcb8e92497 <https://gist.github.com/janl/b6d3f7502aa20b7b9ab9d9dcb8e92497> (I’m just noticing that there might be something wonky with DELETE, but you’ll get the gist #rimshot)
>> 
>> * * *
>> 
>> Open questions:
>> 
>> - The aim of this is to get as close to regular CouchDB behaviour as possible. One thing that is new however which would require all apps to be changed is that for an _access enabled database to include an _access field in their docs (docs with no _access are admin-only for now). We might want to consider on new document writes to auto-insert the authenticated user’s name as the first element in the _access array, so existing apps “just work”.
>> 
>> - Interplay with partitioned dbs: eschewing db-per-user is already a large boon if you have a lot of users, but making those per-user requests inside an _access enabled database efficient would be doubly nice, so why not use the username from the first question above and use that as the partition key? This would work nicely for natural users with their own docs that want to share them with others later, but I can easily imagine a pipelined use of CouchDB, where a “collector” user creates all new docs, an “analyser” takes them over and hand them to a “result” user for viewing. In that case, we’d violate the high-cardinality rule of partitions (have a lot of small ones), instead all docs go through all three users. I’d be okay with treating the later scenario as a minor use-case, but for that use-case, we should be able to disable auto-partitioning on db creation.
>> 
>> - building access view indexes for docs that have frequent _access changes, lead to many orphaned view indexes, we should look at an auto-cleanup solution here (maybe keep 1-N indexes in case folks just swap back and forth).
>> 
>> * * *
>> 
>> I’ll leave this here for now, I’m sure there are a few more things to consider.
>> 
>> I’d love to hear any and all feedback you might have. Especially if anything is unclear.
>> 
>> Best
>> Jan
>> —
> 
> -- 
> Professional Support for Apache CouchDB:
> https://neighbourhood.ie/couchdb-support/
>

Re: [DISCUSS] Per-doc access control

Posted by Jan Lehnardt <ja...@apache.org>.

Hi all,

I have a first RFC up here: https://github.com/apache/couchdb-documentation/pull/424

The goal of this is to have a concrete scope of changes written
down and have it be a minimal set to implement while being useful.

Specifically, this proposal defers to later:

* per-access views
* differentiation between read and write access for documents
* sharing individual documents between multiple users or groups

The feature design so far aims to allow the addition of the above
at a later point. Specifically, per-access views and read/write
differentiation should not be too hard to add, but doc sharing
might better be left for a FoundationDB future.

I’d like a thorough review and comments that specifically address
any holes in the proposal. In particular, I’m interested to hear
if there are any showstoppers in there that a common db-per-user
setup could not migrate over to this model.

Best
Jan
—


> On 4. Apr 2019, at 10:45, Jan Lehnardt <ja...@apache.org> wrote:
> 
> Thanks for your initial comments.
> 
>> On 3. Apr 2019, at 23:07, Adam Kocoloski <ko...@apache.org> wrote:
>> 
>> I’m also in favor of dropping Scenario 3.
>> 
>> One topic we may have discussed in the past but I wanted to close out here: in the relational database world it’s not uncommon to use materialized views as an access control mechanism to selectively expose contents of a table to clients who cannot access the table directly. Does the current thinking on _access for views support that use case? Can we build a view using a set of roles inherited from the user who created the design doc, but then turn around and set the _access on the view itself to a less-restrictive set?
> 
> 3 minutes thinking it over didn’t reveal any particular problems with this feature, aside from include_docs not working as expected, which might be an okay trade-off for now. But could be included later.
> 
> 
>> On the _revs_diff topic — I’m not all that concerned about users trying to guess revision IDs that exist on the server, and then reverse-engineer the contents of the existing revisions. Maybe I ought to be.
> 
> I’m not particularly worried, but it is at least a theoretical situation where our user’s can be caught with their pants down when they didn’t expect it. All I want to make sure is to document this properly. C.f. git where if you get access to a repo, you get the whole history, not just the state from where you started having access.
> 
> Best
> Jan
> —
> 
>> 
>> On a somewhat-related note, I have had conversations before with folks who are keen to adopt these sorts of fine-grained access control systems who said they actually prefer to have a 403 Forbidden response list the set of privileges that would be sufficient to access the resource. I found this surprising, but I guess it comes down to a user needing to figure out what kind of security exception to apply for in order to make progress with some data analysis. I think this is a topic on which we could make a fairly late-binding decision — or even have it as a configurable option.
>> 
>> I could definitely see the base Scenario 1 (single _access labels) landing ahead of the more-complex sharing models.
>> 
>> I haven’t had a chance to take a deep look at the code but the design seems good and thoughtful, and I definitely like the focus on the use cases.
>> 
>> Adam
>> 
>>> On Mar 14, 2019, at 11:21 AM, Jan Lehnardt <ja...@apache.org> wrote:
>>> 
>>> My replies now inline.
>>> 
>>>> On 14. Mar 2019, at 16:13, Jan Lehnardt <ja...@apache.org> wrote:
>>>> 
>>>> I received some notes privately from Gregor Martynus, which I’m reproducing here in email thread form. This email is all Gregor’s notes, my next email is my replies to them.
>>>> 
>>>>> On 10. Mar 2019, at 15:51, Jan Lehnardt <ja...@apache.org> wrote:
>>>>> 
>>>>> Hey all,
>>>>> 
>>>>> after mulling this over some more, I’d like to tackle the detailed API and behaviour for this. Especially how _access work in conjunction with existing access control features.
>>>>> 
>>>>> My guiding principles so far are:
>>>>> 
>>>>> 1. Make the API intuitive, things should work like they look like they should work like.
>>>>> 2. The default should never be that a resources is accidentally left accessible to the public.
>>>>> 3. This should work as a natural extension to the existing security features*.
>>>>> 
>>>>> * I’d be up for reworking the whole lot, too, but that might be a better discussion for > 4.0.
>>>>> 
>>>>> 
>>>>> ## Database Creation and Default Behaviours
>>>>> 
>>>>> Creating a database with _access features is, as mentioned before done via a flag to PUT /database?access=true
>>>>> 
>>>>> In a 3.0 world where this would land, we already agreed that databases should be admin-only by default (instead of world read/writeable today). This is a sensible default, but that leaves us with an _access enabled database that can’t be used by anyone by server or db admins. Not very useful.
>>>>> 
>>>>> To allow arbitrary users to use the db, I suggest we use the existing _security system: i.e. if a user or a group a user belongs to is mentioned in either `admins` or `members` inside of _security, they can proceed and create documents on the db. This puts a second step burden on the application developer, but it slots cleanly into the existing security mechanisms, and doesn’t require special case handling. Alternatively, we could define that _security isn’t available in _access enabled databases, but that’s something I’d like to avoid if at all possible.
>>>>> 
>>>>> In order to make it easy to specify that “everyone in _users” should be able to use the db, I suggest we add a new role `_users` that is valid inside _security, which means “everyone in /_users” (this only excludes server admins which have full access anyway).
>>>>> 
>>>>> * * *
>>>>> 
>>>>> 
>>>>> ## Document Creation and Access Control
>>>>> 
>>>>> Next, one of our non-admin users creates a doc. There are multiple options as to how we store the _access information.
>>>>> 
>>>>> 1. Automatically translate the userCtx.name of a doc creation (not an update) into the first element of the _access array. E.g. user_a PUT /db/doc {"a":1} creates this doc: {"a":1,"_access":["user_a"]}. This is a little bit counter-intuitive.
>>>>> 
>>>>> 2. We require that a user puts "_access":["user_a"] in themselves. This is an explicit granting of access permissions on doc creation and I think is preferable.
>>>> 
>>>> I prefer being explicit.
>>>> 
>>>> 
>>>>> 
>>>>> This leaves the edge case of docs that have no _access member: so far I thought those docs are admin-only, with maybe a db-wide option to swap the default to public access, but I think given the explicitness of 2. we can do better: require _access for all new doc creations in access-enabled databases. A user can not create a new document without an _access field that is an array that has at least one member. For public documents, we could invent a new role _public, and admin-only docs could use the existing role _admin.
>>>>> 
>>>>> The one downside to this approach is that we won’t be able to replicate existing databases into an access-enabled database without modifying all documents. This might be a worthwhile trade-off, but we should make that decision consciously and document it well.
>>>> 
>>>> We could also provide tooling for migrations?
>>> 
>>> I’d love tooling, but we’d have to make sure we can do it correctly for a big number of use-cases. For the acceptance of this change, I’d make “documenting a migration path for db-per-user setups” a MUST have, and any code that helps with that a nice to have.
>>> 
>>>> 
>>>> 
>>>>> We could allow for a special case where an _admin user can create docs that have no _access field, and those docs are treated as having only the _admin role in _access. So at least we could replicate all data in, but then require a manual step to update all docs to say, migrate an existing db-per-user app, while not accidentally exposing any docs to folks that shouldn’t read them.
>>>>> 
>>>>> For the rest of cRUD, the existing document must store one of the RUD-ing user’s name or role in its _access field.
>>>>> 
>>>>> For both creations and updates, a user MUST supply at least one role they belong to or their own username.
>>>>> 
>>>>> * * *
>>>>> 
>>>>> 
>>>>> ## _revs_diff
>>>>> 
>>>>> /db/_revs_diff can answer the question of which revisions of a document do NOT exist on a replication target: http://docs.couchdb.org/en/stable/api/database/misc.html#db-revs-diff
>>>>> 
>>>>> This would allow users to specify ids and rev(s) for docs they don’t have access too (anymore), so the result schema should be expanded to handle id: unauthorized or somesuch, something the replicator needs to know what to do with, if it encounters it (say a user got removed from the _access list inbetween the replicator opening _changes and requesting the doc).
>>>>> 
>>>>> The _revs_diff implementation would have to altered to send an unauthorized token for each doc the requesting userCtx has no access to. If we can re-use some of our existing indexes, or any other performance optimisation, that’d be great. I haven’t looked at that code at all, yet.
>>>>> 
>>>>> An important side-effect of this is, once a user has been added to a doc’s _access list, they get access to “the full history of the doc”, even before they had access. Of course, in CouchDB this means only getting access to the rev ids, and not the content, but since they are content-addressable hashes, a user could brute-force themselves into revealing certain real values from earlier incarnations of the doc. I’d rather not track _access per document revision in perpetuity, so this is something we have to be very up-front about.
>>>>> 
>>>>> * * *
>>>>> 
>>>>> 
>>>>> ## Partitioned Databases
>>>>> 
>>>>> I mentioned partitioned databases in my previous mail, and I think it is something we can document that end-users can opt into, but doesn’t require any special casing on the _access proposal. That is, if users start prefixing their doc ids with a user name or id and enable both _access and partitions, then they get all the benefits of a partitioned database, and if they choose not to, they don’t, but things keep working. There are enough use-cases to warrant both behaviours.
>>>>> 
>>>>> * * *
>>>>> 
>>>>> 
>>>>> ## Scenarios that _access should help with.
>>>>> 
>>>>> Overall, we developed _access to allow users to stop using the db-per-user architecture, but once we have per-doc-access control, folks might start using this for all manner of things. We should be clear about which scenarios we support and which we don’t.
>>>>> 
>>>>> 
>>>>> ### Scenario 1: db-per-user
>>>>> 
>>>>> In this scenario, _access enabled databases, the only way to allow mutually untrusting users to store data in a part of CouchDB that only they (and admins) have access to was giving each user their own database.
>>>>> 
>>>>> In an _access enabled database, users can CRUD/_changes/_all_docs/_revs_diff their own docs knowing no other user (aside from admins) can access those docs.
>>>>> 
>>>>> This is the simplest scenario, as all we’d have to track the owner of a document and produce by-access-id/seq indexes based on that owner.
>>>>> 
>>>>> The current prototype implementation mostly reflects this stage. Not saying this is what we should ship, but it is the easiest do implement and explain.
>>>>> 
>>>>> Aside, I might be able to be persuaded to ship this as a 2.x feature, to help those folks who don’t need anything else.
>>>>> 
>>>>> 
>>>>> ### Scenario 2: db-per-user + Sharing
>>>> 
>>>> One scenario we should address is how stopping to share would work when documents are continuously replicated, e.g. to a client for offline usage. My understanding is that for the person who’s access to documents got revoked does not get _changes update telling them that their access got removed, it would be up to the application developer to implement some kind of "notification" meta documents. Unless you have a better idea?
>>> 
>>> Since we now have a purge API as well, we could treat an un-share as a purge for clients, and they can decide what to do with it.
>>> 
>>> Alternatively, we need to make breaking changes to _changes feed, maybe we can hide that behind an opt-in flag, like “/db/_changes?access=true”, and then we can send new rows like:
>>> 
>>> {seq: XYZ, id: abc, rev:4-YYY, _revoked: true} or somesuch.
>>> 
>>> 
>>>> 
>>>>> 
>>>>> The second we allow per doc auth, users will want to share those docs with other users. That’s why we initially suggested the _access field be an array, so other users and groups can be specified to have access. There are multiple scenarios in this one alone:
>>>>> 
>>>>> #### 2.1: The Todo List
>>>>> 
>>>>> In this scenario, a user has a reasonable amount of ”personal data” that they want to selectively share with one or more other users.
>>>>> 
>>>>> #### 2.2: The Chat/Forum/Newsgroup
>>>>> 
>>>>> In this scenario, a user wants to share any number of documents with a reasonable number of groups. However, since we need to limit the number of groups a user belongs to (currently 10, see below for details), this might actually not be a great solution. Or folks couldn’t be in more than 10 chat groups at a time.
>>>>> 
>>>>> #### 2.3: The Corporate Hierarchy
>>>>> 
>>>>> In this scenario, users want to share any number of docs with a reasonable number of groups in a top-down/bottom-up fashion. Think CEO shares with executives, execs share with divisions, divisions report up to their one executive, etc.
>>>>> 
>>>>> 
>>>>> ### 3: Multiple Apps
>>>>> 
>>>>> The preceding scenarios all assume that a single application is responsible for everything. However, once we allow mutually distrusting users into a single database *and* make each per-user slice work (almost) like a full standalone CouchDB database, what would stop users from using this for a multi-homing feature, where different applications are used for each user in the same database?
>>>>> 
>>>>> I’ll be referring to these scenarios down the line.
>>>>> 
>>>>> * * *
>>>>> 
>>>>> 
>>>>> ## Design Docs
>>>>> 
>>>>> ### Admin
>>>>> 
>>>>> One of the downsides of db-per-user is managing design docs in the face of a changing application, that is, how to distribute new design docs across 10s of 1000+s of user dbs? It’s not impossible, but tedious. In all scenarios above but scenario 3., we could simplify this significantly. Say an admin creates a design doc, and gives all users in the db access to this design doc (this could be with the _users role, or yet another new role _members, if we need it), requesting the result of a view defined in that design doc will produce an index that is powered by the requesting user’s by-access-seq index section(s).
>>>>> 
>>>>> N.B., this would require us to change a fundamental assumption when doing the association between a design doc’s definition and index: normally, there is only the `views` member that is hashed and that hash is used as the index’s filename. Because there is only by-seq to power a view, that all works. But now that we have an arbitrary set of sections on by-access-seq, any view index built will have to take a user’s name and roles into account. When a user leaves a group, or gains a group, all indexes for that user will no longer be valid and need rebuilding.
>>>>> 
>>>>> 
>>>>> ### User
>>>>> 
>>>>> In any of the scenarios above, but especially 3., there could be legitimate per-user design docs, so how should those be treated in an _access enabled database?
>>>>> 
>>>>> The significant fields in a design doc are `views`, `validate_doc_update` and `filters` (I’ll skip over the deprecated _show, _list, and _update).
>>>>> 
>>>>> The easiest to handle is a `filters`: if a user specifies a filter for a _changes request or replication that lives in a design doc they don’t have access to, they get an error, similar to if they specify a non-existent design doc, just with `unauthorized` instead of `not_found`.
>>>>> 
>>>>> Next `views` is also not very hard to imagine working: just like globally defined views for that db, the index is built for each user based on the user’s name and roles.
>>>>> 
>>>>> More troubling are `validate_doc_update` functions: One, they are already troubling in that they slow down any document updates. Two, if we now import an existing db-per-user scenario where each user has their own design docs,
>>>> 
>>>> I can’t think of a db-per-user scenario where each user DB would have a different validate_doc_update method? It would be the same method with access to the user context, the DBs security setting and the document, so it would act differently for different users, but using the same code.
>>> 
>>> They wouldn’t be different, but if we were do replicate 1000 db-per-user design docs into a single database, as per today’s semantics, we’d have to run 1000 VDUs on each doc update.
>>> 
>>>> 
>>>>> how should we apply validate_doc_update functions? 10s of 1000s of VDUs are impractical to apply on each doc update, let alone just the management of VDUs that are active on a database. One option would be to ignore VDUs if they are not defined globally (say with a _members role). But especially in scenario 3. this becomes problematic, but even without that specific scenario, this violates the no surprises best practice.
>>>>> 
>>>>> We could say:
>>>>> 
>>>>> a) we don’t support scenario 3.
>>>> 
>>>> +1, I think it would make our lives easier in general if we don’t recommend to share the same CouchDB for multiple apps. At least I don’t see a reason to do that at this point.
>>> 
>>> I think I like this best, too, but I’d like to hear from others as well.
>>> 
>>> 
>>> Best
>>> Jan
>>> —
>>>> 
>>>>> b) we find a complicated but efficient way to apply only those VDUs that are defined in design docs the writing user has access to plus any global ones (this would be neat but rather complicated and potentially still impractical from a performance perspective for N users).
>>>>> c) we could store all per-user design docs, but ignore them completely, VDUs, views and filters.
>>>>> 
>>>>> I think I currently fall on the side of not supporting scenario 3. and asking folks who migrate db-per-user to de-duplicate design docs and keep them per-app. I believe that is a good trade-off between the most common scenarios for db-per-user while keeping the implementation manageable. Globally accessible design docs would show up in a user’s changes feed and would replicate down to say a PouchDB application which might be the exclusive user of those design docs.
>>>>> 
>>>>> In practice this would mean, a document that has an _id that starts with _design/ will have to be produced by a database admin. Luckily, that’s already the case. We should just make sure that folks don’t give db-admin access to all users habitually.
>>>>> 
>>>>> 
>>>>> ## Read and Write Access
>>>>> 
>>>>> Speaking of validate_doc_update, it is used for two things: checking document schema and doc update authorisation.
>>>>> 
>>>>> Once we allow access to a document with an _access field, we need to decide what kind of access this gives to a doc: read-only or read-write (I’m not considering write-only because for anything but doc creations this is not useful as you need access to the current _rev).
>>>>> 
>>>>> However, when we look at implementing an application on top of our existing API, it is already weird that read access can be controlled globally (or with _access on a per doc level), but write access requires writing JavaScript code. I think it would be a reasonable expectation for users to expect a per-doc read/write permission granting.
>>>> 
>>>> Yes!
>>>> 
>>>>> 
>>>>> So we could have all of the above, but with two extra fields: _access_read and _access_write, or _access: {read: [], write: []}
>>>> 
>>>> I prefer this API for its compactness, thinking about offline synchronization. The smaller the docs, the better.
>>>> 
>>>> Best
>>>> “Gregor”
>>>> —
>>>> 
>>>> 
>>>>> or we overload user and group names: _access: [user_a:read, user_b:write] (or any permutation thereof). Overloading can cause trouble with naturally occurring characters in group names.
>>>>> 
>>>>> The former seems more explicit, but from an API perspective that’s a little more awkward: remember that we currently have an arbitrary limit of 10 members in a user’s role array, to avoid excessive fan out on cluster-internal operations. Partitioned dbs could get away with more, more easily however. If we allow the specification of access control in two lists, and one of the lists implies membership in the other, we have a total limit of 10 members across both arrays. Or we limit 5 + 5, but that seems excessive, while 10 total seems weird, but doable. Anyway, good bikeshed.
>>>>> 
>>>>> 
>>>>> * * * 
>>>>> 
>>>>> 
>>>>> So far. I think all of the problems outlined are solvable, if with a clear definition of what use-cases we do not support with access. If you have more scenarios than the ones I outlined, please add them and we can see if they cause any additional trouble.
>>>>> 
>>>>> Thanks for reading this far and I’m looking forward to your feedback.
>>>>> 
>>>>> 
>>>>> Best,
>>>>> Jan “_access” Lehnardt
>>>>> —
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>> On 17. Feb 2019, at 15:25, Jan Lehnardt <ja...@apache.org> wrote:
>>>>>> 
>>>>>> Hi Everyone,
>>>>>> 
>>>>>> I’m happy to share my work in progress attempt to implement the per-doc access control feature we discussed a good while ago:
>>>>>> 
>>>>>> https://lists.apache.org/thread.html/6aa77dd8e5974a3a540758c6902ccb509ab5a2e4802ecf4fd724a5e4@%3Cdev.couchdb.apache.org%3E <https://lists.apache.org/thread.html/6aa77dd8e5974a3a540758c6902ccb509ab5a2e4802ecf4fd724a5e4@%3Cdev.couchdb.apache.org%3E>
>>>>>> 
>>>>>> You can check out my branch here:
>>>>>> 
>>>>>> https://github.com/apache/couchdb/compare/access?expand=1 <https://github.com/apache/couchdb/compare/access?expand=1>
>>>>>> 
>>>>>> It is very much work in progress, but it is far enough along to warrant discussion.
>>>>>> 
>>>>>> The main point of this branch is to show all the places that we would need to change to support the proposal.
>>>>>> 
>>>>>> Things I’ve left for later:
>>>>>> 
>>>>>> - currently only the first element in the _access array is used. Our and/or syntax can be added later.
>>>>>> - building per-access views has not been implemented yet, couch_index would have to be taught about the new per-access-id index.
>>>>>> - pretty HTTP error handling
>>>>>> - tests except for a tiny shell script 😇
>>>>>> 
>>>>>> Implementation notes:
>>>>>> 
>>>>>> You create a database with the _access feature turned on like so:  PUT /db?access=true
>>>>>> 
>>>>>> I started out with storing _access in the document body, as that would allow for a minimal change set, however, on doc updates, we try hard not to load the old doc body from the database, and forcing us to do so for EVERY doc update under _access seemed prohibitive, so I extended the #doc, #doc_info and #full_doc_info records with a new `access` attribute that is stored in both by-id and by-seq. I will need guidance on how extending these records impact multi-version cluster interop. And especially whether this is an acceptable approach.
>>>>>> 
>>>>>> https://github.com/apache/couchdb/compare/access?expand=1&ws=0#diff-904ab7473ff8ddd07ea44aca414e3a36
>>>>>> 
>>>>>> * * *
>>>>>> 
>>>>>> The main addition is a new native query server called couch_access_native_proc, which implements two new indexes by-access-id and by-access-seq which do what you’d expect, pass in a userCtx and retrieve the equivalent of _all_docs or _changes, but only including those docs that match the username and roles in their _access property. The existing handlers for _all_docs and _changes have been augmented to use the new indexes instead of the default ones, unless the user is an admin.
>>>>>> 
>>>>>> https://github.com/apache/couchdb/compare/access?expand=1&ws=0#diff-fbb53323f07579be5e46ba63cb6701c4
>>>>>> 
>>>>>> * * *
>>>>>> 
>>>>>> The rest of the diff is concerned with making document CRUD behave as you’d expect it. See this little demonstration for what things look like:
>>>>>> 
>>>>>> https://gist.github.com/janl/b6d3f7502aa20b7b9ab9d9dcb8e92497 <https://gist.github.com/janl/b6d3f7502aa20b7b9ab9d9dcb8e92497> (I’m just noticing that there might be something wonky with DELETE, but you’ll get the gist #rimshot)
>>>>>> 
>>>>>> * * *
>>>>>> 
>>>>>> Open questions:
>>>>>> 
>>>>>> - The aim of this is to get as close to regular CouchDB behaviour as possible. One thing that is new however which would require all apps to be changed is that for an _access enabled database to include an _access field in their docs (docs with no _access are admin-only for now). We might want to consider on new document writes to auto-insert the authenticated user’s name as the first element in the _access array, so existing apps “just work”.
>>>>>> 
>>>>>> - Interplay with partitioned dbs: eschewing db-per-user is already a large boon if you have a lot of users, but making those per-user requests inside an _access enabled database efficient would be doubly nice, so why not use the username from the first question above and use that as the partition key? This would work nicely for natural users with their own docs that want to share them with others later, but I can easily imagine a pipelined use of CouchDB, where a “collector” user creates all new docs, an “analyser” takes them over and hand them to a “result” user for viewing. In that case, we’d violate the high-cardinality rule of partitions (have a lot of small ones), instead all docs go through all three users. I’d be okay with treating the later scenario as a minor use-case, but for that use-case, we should be able to disable auto-partitioning on db creation.
>>>>>> 
>>>>>> - building access view indexes for docs that have frequent _access changes, lead to many orphaned view indexes, we should look at an auto-cleanup solution here (maybe keep 1-N indexes in case folks just swap back and forth).
>>>>>> 
>>>>>> * * *
>>>>>> 
>>>>>> I’ll leave this here for now, I’m sure there are a few more things to consider.
>>>>>> 
>>>>>> I’d love to hear any and all feedback you might have. Especially if anything is unclear.
>>>>>> 
>>>>>> Best
>>>>>> Jan
>>>>>> —
>>>>> 
>>>>> -- 
>>>>> Professional Support for Apache CouchDB:
>>>>> https://neighbourhood.ie/couchdb-support/
>>>>> 
>>>> 
>>>> -- 
>>>> Professional Support for Apache CouchDB:
>>>> https://neighbourhood.ie/couchdb-support/
>>>> 
>>> 
>>> -- 
>>> Professional Support for Apache CouchDB:
>>> https://neighbourhood.ie/couchdb-support/ <https://neighbourhood.ie/couchdb-support/>
> 
> -- 
> Professional Support for Apache CouchDB:
> https://neighbourhood.ie/couchdb-support/
> 

-- 
Professional Support for Apache CouchDB:
https://neighbourhood.ie/couchdb-support/

Re: [DISCUSS] Per-doc access control

Posted by Jan Lehnardt <ja...@apache.org>.

Thanks for your initial comments.

> On 3. Apr 2019, at 23:07, Adam Kocoloski <ko...@apache.org> wrote:
> 
> I’m also in favor of dropping Scenario 3.
> 
> One topic we may have discussed in the past but I wanted to close out here: in the relational database world it’s not uncommon to use materialized views as an access control mechanism to selectively expose contents of a table to clients who cannot access the table directly. Does the current thinking on _access for views support that use case? Can we build a view using a set of roles inherited from the user who created the design doc, but then turn around and set the _access on the view itself to a less-restrictive set?

3 minutes thinking it over didn’t reveal any particular problems with this feature, aside from include_docs not working as expected, which might be an okay trade-off for now. But could be included later.


> On the _revs_diff topic — I’m not all that concerned about users trying to guess revision IDs that exist on the server, and then reverse-engineer the contents of the existing revisions. Maybe I ought to be.

I’m not particularly worried, but it is at least a theoretical situation where our user’s can be caught with their pants down when they didn’t expect it. All I want to make sure is to document this properly. C.f. git where if you get access to a repo, you get the whole history, not just the state from where you started having access.

Best
Jan
—

> 
> On a somewhat-related note, I have had conversations before with folks who are keen to adopt these sorts of fine-grained access control systems who said they actually prefer to have a 403 Forbidden response list the set of privileges that would be sufficient to access the resource. I found this surprising, but I guess it comes down to a user needing to figure out what kind of security exception to apply for in order to make progress with some data analysis. I think this is a topic on which we could make a fairly late-binding decision — or even have it as a configurable option.
> 
> I could definitely see the base Scenario 1 (single _access labels) landing ahead of the more-complex sharing models.
> 
> I haven’t had a chance to take a deep look at the code but the design seems good and thoughtful, and I definitely like the focus on the use cases.
> 
> Adam
> 
>> On Mar 14, 2019, at 11:21 AM, Jan Lehnardt <ja...@apache.org> wrote:
>> 
>> My replies now inline.
>> 
>>> On 14. Mar 2019, at 16:13, Jan Lehnardt <ja...@apache.org> wrote:
>>> 
>>> I received some notes privately from Gregor Martynus, which I’m reproducing here in email thread form. This email is all Gregor’s notes, my next email is my replies to them.
>>> 
>>>> On 10. Mar 2019, at 15:51, Jan Lehnardt <ja...@apache.org> wrote:
>>>> 
>>>> Hey all,
>>>> 
>>>> after mulling this over some more, I’d like to tackle the detailed API and behaviour for this. Especially how _access work in conjunction with existing access control features.
>>>> 
>>>> My guiding principles so far are:
>>>> 
>>>> 1. Make the API intuitive, things should work like they look like they should work like.
>>>> 2. The default should never be that a resources is accidentally left accessible to the public.
>>>> 3. This should work as a natural extension to the existing security features*.
>>>> 
>>>> * I’d be up for reworking the whole lot, too, but that might be a better discussion for > 4.0.
>>>> 
>>>> 
>>>> ## Database Creation and Default Behaviours
>>>> 
>>>> Creating a database with _access features is, as mentioned before done via a flag to PUT /database?access=true
>>>> 
>>>> In a 3.0 world where this would land, we already agreed that databases should be admin-only by default (instead of world read/writeable today). This is a sensible default, but that leaves us with an _access enabled database that can’t be used by anyone by server or db admins. Not very useful.
>>>> 
>>>> To allow arbitrary users to use the db, I suggest we use the existing _security system: i.e. if a user or a group a user belongs to is mentioned in either `admins` or `members` inside of _security, they can proceed and create documents on the db. This puts a second step burden on the application developer, but it slots cleanly into the existing security mechanisms, and doesn’t require special case handling. Alternatively, we could define that _security isn’t available in _access enabled databases, but that’s something I’d like to avoid if at all possible.
>>>> 
>>>> In order to make it easy to specify that “everyone in _users” should be able to use the db, I suggest we add a new role `_users` that is valid inside _security, which means “everyone in /_users” (this only excludes server admins which have full access anyway).
>>>> 
>>>> * * *
>>>> 
>>>> 
>>>> ## Document Creation and Access Control
>>>> 
>>>> Next, one of our non-admin users creates a doc. There are multiple options as to how we store the _access information.
>>>> 
>>>> 1. Automatically translate the userCtx.name of a doc creation (not an update) into the first element of the _access array. E.g. user_a PUT /db/doc {"a":1} creates this doc: {"a":1,"_access":["user_a"]}. This is a little bit counter-intuitive.
>>>> 
>>>> 2. We require that a user puts "_access":["user_a"] in themselves. This is an explicit granting of access permissions on doc creation and I think is preferable.
>>> 
>>> I prefer being explicit.
>>> 
>>> 
>>>> 
>>>> This leaves the edge case of docs that have no _access member: so far I thought those docs are admin-only, with maybe a db-wide option to swap the default to public access, but I think given the explicitness of 2. we can do better: require _access for all new doc creations in access-enabled databases. A user can not create a new document without an _access field that is an array that has at least one member. For public documents, we could invent a new role _public, and admin-only docs could use the existing role _admin.
>>>> 
>>>> The one downside to this approach is that we won’t be able to replicate existing databases into an access-enabled database without modifying all documents. This might be a worthwhile trade-off, but we should make that decision consciously and document it well.
>>> 
>>> We could also provide tooling for migrations?
>> 
>> I’d love tooling, but we’d have to make sure we can do it correctly for a big number of use-cases. For the acceptance of this change, I’d make “documenting a migration path for db-per-user setups” a MUST have, and any code that helps with that a nice to have.
>> 
>>> 
>>> 
>>>> We could allow for a special case where an _admin user can create docs that have no _access field, and those docs are treated as having only the _admin role in _access. So at least we could replicate all data in, but then require a manual step to update all docs to say, migrate an existing db-per-user app, while not accidentally exposing any docs to folks that shouldn’t read them.
>>>> 
>>>> For the rest of cRUD, the existing document must store one of the RUD-ing user’s name or role in its _access field.
>>>> 
>>>> For both creations and updates, a user MUST supply at least one role they belong to or their own username.
>>>> 
>>>> * * *
>>>> 
>>>> 
>>>> ## _revs_diff
>>>> 
>>>> /db/_revs_diff can answer the question of which revisions of a document do NOT exist on a replication target: http://docs.couchdb.org/en/stable/api/database/misc.html#db-revs-diff
>>>> 
>>>> This would allow users to specify ids and rev(s) for docs they don’t have access too (anymore), so the result schema should be expanded to handle id: unauthorized or somesuch, something the replicator needs to know what to do with, if it encounters it (say a user got removed from the _access list inbetween the replicator opening _changes and requesting the doc).
>>>> 
>>>> The _revs_diff implementation would have to altered to send an unauthorized token for each doc the requesting userCtx has no access to. If we can re-use some of our existing indexes, or any other performance optimisation, that’d be great. I haven’t looked at that code at all, yet.
>>>> 
>>>> An important side-effect of this is, once a user has been added to a doc’s _access list, they get access to “the full history of the doc”, even before they had access. Of course, in CouchDB this means only getting access to the rev ids, and not the content, but since they are content-addressable hashes, a user could brute-force themselves into revealing certain real values from earlier incarnations of the doc. I’d rather not track _access per document revision in perpetuity, so this is something we have to be very up-front about.
>>>> 
>>>> * * *
>>>> 
>>>> 
>>>> ## Partitioned Databases
>>>> 
>>>> I mentioned partitioned databases in my previous mail, and I think it is something we can document that end-users can opt into, but doesn’t require any special casing on the _access proposal. That is, if users start prefixing their doc ids with a user name or id and enable both _access and partitions, then they get all the benefits of a partitioned database, and if they choose not to, they don’t, but things keep working. There are enough use-cases to warrant both behaviours.
>>>> 
>>>> * * *
>>>> 
>>>> 
>>>> ## Scenarios that _access should help with.
>>>> 
>>>> Overall, we developed _access to allow users to stop using the db-per-user architecture, but once we have per-doc-access control, folks might start using this for all manner of things. We should be clear about which scenarios we support and which we don’t.
>>>> 
>>>> 
>>>> ### Scenario 1: db-per-user
>>>> 
>>>> In this scenario, _access enabled databases, the only way to allow mutually untrusting users to store data in a part of CouchDB that only they (and admins) have access to was giving each user their own database.
>>>> 
>>>> In an _access enabled database, users can CRUD/_changes/_all_docs/_revs_diff their own docs knowing no other user (aside from admins) can access those docs.
>>>> 
>>>> This is the simplest scenario, as all we’d have to track the owner of a document and produce by-access-id/seq indexes based on that owner.
>>>> 
>>>> The current prototype implementation mostly reflects this stage. Not saying this is what we should ship, but it is the easiest do implement and explain.
>>>> 
>>>> Aside, I might be able to be persuaded to ship this as a 2.x feature, to help those folks who don’t need anything else.
>>>> 
>>>> 
>>>> ### Scenario 2: db-per-user + Sharing
>>> 
>>> One scenario we should address is how stopping to share would work when documents are continuously replicated, e.g. to a client for offline usage. My understanding is that for the person who’s access to documents got revoked does not get _changes update telling them that their access got removed, it would be up to the application developer to implement some kind of "notification" meta documents. Unless you have a better idea?
>> 
>> Since we now have a purge API as well, we could treat an un-share as a purge for clients, and they can decide what to do with it.
>> 
>> Alternatively, we need to make breaking changes to _changes feed, maybe we can hide that behind an opt-in flag, like “/db/_changes?access=true”, and then we can send new rows like:
>> 
>> {seq: XYZ, id: abc, rev:4-YYY, _revoked: true} or somesuch.
>> 
>> 
>>> 
>>>> 
>>>> The second we allow per doc auth, users will want to share those docs with other users. That’s why we initially suggested the _access field be an array, so other users and groups can be specified to have access. There are multiple scenarios in this one alone:
>>>> 
>>>> #### 2.1: The Todo List
>>>> 
>>>> In this scenario, a user has a reasonable amount of ”personal data” that they want to selectively share with one or more other users.
>>>> 
>>>> #### 2.2: The Chat/Forum/Newsgroup
>>>> 
>>>> In this scenario, a user wants to share any number of documents with a reasonable number of groups. However, since we need to limit the number of groups a user belongs to (currently 10, see below for details), this might actually not be a great solution. Or folks couldn’t be in more than 10 chat groups at a time.
>>>> 
>>>> #### 2.3: The Corporate Hierarchy
>>>> 
>>>> In this scenario, users want to share any number of docs with a reasonable number of groups in a top-down/bottom-up fashion. Think CEO shares with executives, execs share with divisions, divisions report up to their one executive, etc.
>>>> 
>>>> 
>>>> ### 3: Multiple Apps
>>>> 
>>>> The preceding scenarios all assume that a single application is responsible for everything. However, once we allow mutually distrusting users into a single database *and* make each per-user slice work (almost) like a full standalone CouchDB database, what would stop users from using this for a multi-homing feature, where different applications are used for each user in the same database?
>>>> 
>>>> I’ll be referring to these scenarios down the line.
>>>> 
>>>> * * *
>>>> 
>>>> 
>>>> ## Design Docs
>>>> 
>>>> ### Admin
>>>> 
>>>> One of the downsides of db-per-user is managing design docs in the face of a changing application, that is, how to distribute new design docs across 10s of 1000+s of user dbs? It’s not impossible, but tedious. In all scenarios above but scenario 3., we could simplify this significantly. Say an admin creates a design doc, and gives all users in the db access to this design doc (this could be with the _users role, or yet another new role _members, if we need it), requesting the result of a view defined in that design doc will produce an index that is powered by the requesting user’s by-access-seq index section(s).
>>>> 
>>>> N.B., this would require us to change a fundamental assumption when doing the association between a design doc’s definition and index: normally, there is only the `views` member that is hashed and that hash is used as the index’s filename. Because there is only by-seq to power a view, that all works. But now that we have an arbitrary set of sections on by-access-seq, any view index built will have to take a user’s name and roles into account. When a user leaves a group, or gains a group, all indexes for that user will no longer be valid and need rebuilding.
>>>> 
>>>> 
>>>> ### User
>>>> 
>>>> In any of the scenarios above, but especially 3., there could be legitimate per-user design docs, so how should those be treated in an _access enabled database?
>>>> 
>>>> The significant fields in a design doc are `views`, `validate_doc_update` and `filters` (I’ll skip over the deprecated _show, _list, and _update).
>>>> 
>>>> The easiest to handle is a `filters`: if a user specifies a filter for a _changes request or replication that lives in a design doc they don’t have access to, they get an error, similar to if they specify a non-existent design doc, just with `unauthorized` instead of `not_found`.
>>>> 
>>>> Next `views` is also not very hard to imagine working: just like globally defined views for that db, the index is built for each user based on the user’s name and roles.
>>>> 
>>>> More troubling are `validate_doc_update` functions: One, they are already troubling in that they slow down any document updates. Two, if we now import an existing db-per-user scenario where each user has their own design docs,
>>> 
>>> I can’t think of a db-per-user scenario where each user DB would have a different validate_doc_update method? It would be the same method with access to the user context, the DBs security setting and the document, so it would act differently for different users, but using the same code.
>> 
>> They wouldn’t be different, but if we were do replicate 1000 db-per-user design docs into a single database, as per today’s semantics, we’d have to run 1000 VDUs on each doc update.
>> 
>>> 
>>>> how should we apply validate_doc_update functions? 10s of 1000s of VDUs are impractical to apply on each doc update, let alone just the management of VDUs that are active on a database. One option would be to ignore VDUs if they are not defined globally (say with a _members role). But especially in scenario 3. this becomes problematic, but even without that specific scenario, this violates the no surprises best practice.
>>>> 
>>>> We could say:
>>>> 
>>>> a) we don’t support scenario 3.
>>> 
>>> +1, I think it would make our lives easier in general if we don’t recommend to share the same CouchDB for multiple apps. At least I don’t see a reason to do that at this point.
>> 
>> I think I like this best, too, but I’d like to hear from others as well.
>> 
>> 
>> Best
>> Jan
>> —
>>> 
>>>> b) we find a complicated but efficient way to apply only those VDUs that are defined in design docs the writing user has access to plus any global ones (this would be neat but rather complicated and potentially still impractical from a performance perspective for N users).
>>>> c) we could store all per-user design docs, but ignore them completely, VDUs, views and filters.
>>>> 
>>>> I think I currently fall on the side of not supporting scenario 3. and asking folks who migrate db-per-user to de-duplicate design docs and keep them per-app. I believe that is a good trade-off between the most common scenarios for db-per-user while keeping the implementation manageable. Globally accessible design docs would show up in a user’s changes feed and would replicate down to say a PouchDB application which might be the exclusive user of those design docs.
>>>> 
>>>> In practice this would mean, a document that has an _id that starts with _design/ will have to be produced by a database admin. Luckily, that’s already the case. We should just make sure that folks don’t give db-admin access to all users habitually.
>>>> 
>>>> 
>>>> ## Read and Write Access
>>>> 
>>>> Speaking of validate_doc_update, it is used for two things: checking document schema and doc update authorisation.
>>>> 
>>>> Once we allow access to a document with an _access field, we need to decide what kind of access this gives to a doc: read-only or read-write (I’m not considering write-only because for anything but doc creations this is not useful as you need access to the current _rev).
>>>> 
>>>> However, when we look at implementing an application on top of our existing API, it is already weird that read access can be controlled globally (or with _access on a per doc level), but write access requires writing JavaScript code. I think it would be a reasonable expectation for users to expect a per-doc read/write permission granting.
>>> 
>>> Yes!
>>> 
>>>> 
>>>> So we could have all of the above, but with two extra fields: _access_read and _access_write, or _access: {read: [], write: []}
>>> 
>>> I prefer this API for its compactness, thinking about offline synchronization. The smaller the docs, the better.
>>> 
>>> Best
>>> “Gregor”
>>> —
>>> 
>>> 
>>>> or we overload user and group names: _access: [user_a:read, user_b:write] (or any permutation thereof). Overloading can cause trouble with naturally occurring characters in group names.
>>>> 
>>>> The former seems more explicit, but from an API perspective that’s a little more awkward: remember that we currently have an arbitrary limit of 10 members in a user’s role array, to avoid excessive fan out on cluster-internal operations. Partitioned dbs could get away with more, more easily however. If we allow the specification of access control in two lists, and one of the lists implies membership in the other, we have a total limit of 10 members across both arrays. Or we limit 5 + 5, but that seems excessive, while 10 total seems weird, but doable. Anyway, good bikeshed.
>>>> 
>>>> 
>>>> * * * 
>>>> 
>>>> 
>>>> So far. I think all of the problems outlined are solvable, if with a clear definition of what use-cases we do not support with access. If you have more scenarios than the ones I outlined, please add them and we can see if they cause any additional trouble.
>>>> 
>>>> Thanks for reading this far and I’m looking forward to your feedback.
>>>> 
>>>> 
>>>> Best,
>>>> Jan “_access” Lehnardt
>>>> —
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> On 17. Feb 2019, at 15:25, Jan Lehnardt <ja...@apache.org> wrote:
>>>>> 
>>>>> Hi Everyone,
>>>>> 
>>>>> I’m happy to share my work in progress attempt to implement the per-doc access control feature we discussed a good while ago:
>>>>> 
>>>>> https://lists.apache.org/thread.html/6aa77dd8e5974a3a540758c6902ccb509ab5a2e4802ecf4fd724a5e4@%3Cdev.couchdb.apache.org%3E <https://lists.apache.org/thread.html/6aa77dd8e5974a3a540758c6902ccb509ab5a2e4802ecf4fd724a5e4@%3Cdev.couchdb.apache.org%3E>
>>>>> 
>>>>> You can check out my branch here:
>>>>> 
>>>>> https://github.com/apache/couchdb/compare/access?expand=1 <https://github.com/apache/couchdb/compare/access?expand=1>
>>>>> 
>>>>> It is very much work in progress, but it is far enough along to warrant discussion.
>>>>> 
>>>>> The main point of this branch is to show all the places that we would need to change to support the proposal.
>>>>> 
>>>>> Things I’ve left for later:
>>>>> 
>>>>> - currently only the first element in the _access array is used. Our and/or syntax can be added later.
>>>>> - building per-access views has not been implemented yet, couch_index would have to be taught about the new per-access-id index.
>>>>> - pretty HTTP error handling
>>>>> - tests except for a tiny shell script 😇
>>>>> 
>>>>> Implementation notes:
>>>>> 
>>>>> You create a database with the _access feature turned on like so:  PUT /db?access=true
>>>>> 
>>>>> I started out with storing _access in the document body, as that would allow for a minimal change set, however, on doc updates, we try hard not to load the old doc body from the database, and forcing us to do so for EVERY doc update under _access seemed prohibitive, so I extended the #doc, #doc_info and #full_doc_info records with a new `access` attribute that is stored in both by-id and by-seq. I will need guidance on how extending these records impact multi-version cluster interop. And especially whether this is an acceptable approach.
>>>>> 
>>>>> https://github.com/apache/couchdb/compare/access?expand=1&ws=0#diff-904ab7473ff8ddd07ea44aca414e3a36
>>>>> 
>>>>> * * *
>>>>> 
>>>>> The main addition is a new native query server called couch_access_native_proc, which implements two new indexes by-access-id and by-access-seq which do what you’d expect, pass in a userCtx and retrieve the equivalent of _all_docs or _changes, but only including those docs that match the username and roles in their _access property. The existing handlers for _all_docs and _changes have been augmented to use the new indexes instead of the default ones, unless the user is an admin.
>>>>> 
>>>>> https://github.com/apache/couchdb/compare/access?expand=1&ws=0#diff-fbb53323f07579be5e46ba63cb6701c4
>>>>> 
>>>>> * * *
>>>>> 
>>>>> The rest of the diff is concerned with making document CRUD behave as you’d expect it. See this little demonstration for what things look like:
>>>>> 
>>>>> https://gist.github.com/janl/b6d3f7502aa20b7b9ab9d9dcb8e92497 <https://gist.github.com/janl/b6d3f7502aa20b7b9ab9d9dcb8e92497> (I’m just noticing that there might be something wonky with DELETE, but you’ll get the gist #rimshot)
>>>>> 
>>>>> * * *
>>>>> 
>>>>> Open questions:
>>>>> 
>>>>> - The aim of this is to get as close to regular CouchDB behaviour as possible. One thing that is new however which would require all apps to be changed is that for an _access enabled database to include an _access field in their docs (docs with no _access are admin-only for now). We might want to consider on new document writes to auto-insert the authenticated user’s name as the first element in the _access array, so existing apps “just work”.
>>>>> 
>>>>> - Interplay with partitioned dbs: eschewing db-per-user is already a large boon if you have a lot of users, but making those per-user requests inside an _access enabled database efficient would be doubly nice, so why not use the username from the first question above and use that as the partition key? This would work nicely for natural users with their own docs that want to share them with others later, but I can easily imagine a pipelined use of CouchDB, where a “collector” user creates all new docs, an “analyser” takes them over and hand them to a “result” user for viewing. In that case, we’d violate the high-cardinality rule of partitions (have a lot of small ones), instead all docs go through all three users. I’d be okay with treating the later scenario as a minor use-case, but for that use-case, we should be able to disable auto-partitioning on db creation.
>>>>> 
>>>>> - building access view indexes for docs that have frequent _access changes, lead to many orphaned view indexes, we should look at an auto-cleanup solution here (maybe keep 1-N indexes in case folks just swap back and forth).
>>>>> 
>>>>> * * *
>>>>> 
>>>>> I’ll leave this here for now, I’m sure there are a few more things to consider.
>>>>> 
>>>>> I’d love to hear any and all feedback you might have. Especially if anything is unclear.
>>>>> 
>>>>> Best
>>>>> Jan
>>>>> —
>>>> 
>>>> -- 
>>>> Professional Support for Apache CouchDB:
>>>> https://neighbourhood.ie/couchdb-support/
>>>> 
>>> 
>>> -- 
>>> Professional Support for Apache CouchDB:
>>> https://neighbourhood.ie/couchdb-support/
>>> 
>> 
>> -- 
>> Professional Support for Apache CouchDB:
>> https://neighbourhood.ie/couchdb-support/ <https://neighbourhood.ie/couchdb-support/>

-- 
Professional Support for Apache CouchDB:
https://neighbourhood.ie/couchdb-support/

Re: [DISCUSS] Per-doc access control

Posted by Jan Lehnardt <ja...@apache.org>.


> On 4. Apr 2019, at 00:25, Robert Samuel Newson <rn...@apache.org> wrote:
> 
> Hi,
> 
> It’s sounds like we require a separate flag as to whether any given username or role may appear in such an informative 403 message. That is, wherever we sa that X is granted access, we have an optional flag on X to say if we can disclose its right of access to others, defaulting to false if not present.

If we make this a per-db setting, it shouldn’t be hard to add. If folks need to have this work on a per-doc basis, I’m happy to argue that that should be done outside of CouchDB :)

Best
Jan
—
> 
> B.
> 
>> On 3 Apr 2019, at 23:21, Adam Kocoloski <ko...@apache.org> wrote:
>> 
>> Totally agree it’s information leakage - that’s why I found it surprising that this was their desired mode of operation. It works when there’s a relatively small set of labels that get applied to data, and the labels themselves are not all that confidential.
>> 
>> Adam
>> 
>>> On Apr 3, 2019, at 5:53 PM, Joan Touzet <wo...@apache.org> wrote:
>>> 
>>> One parenthetical...
>>> 
>>>> From: "Adam Kocoloski" <ko...@apache.org>
>>>> 
>>>> On a somewhat-related note, I have had conversations before with
>>>> folks who are keen to adopt these sorts of fine-grained access
>>>> control systems who said they actually prefer to have a 403
>>>> Forbidden response list the set of privileges that would be
>>>> sufficient to access the resource. I found this surprising, but I
>>>> guess it comes down to a user needing to figure out what kind of
>>>> security exception to apply for in order to make progress with some
>>>> data analysis. I think this is a topic on which we could make a
>>>> fairly late-binding decision — or even have it as a configurable
>>>> option.
>>> 
>>> Anyone who's ever had to deal with Amazon's AWS IAM configuration
>>> certainly can appreciate this need. I'm +1 on the idea, assuming it's
>>> not hard to implement...but...
>>> 
>>> The problem is that it can be a data leak. In Jan's initial gist, he
>>> shows the _access field being populated by usernames only (Scenario 1).
>>> The only possible exception here is to get your username added to the
>>> _access field on that document.
>>> 
>>> If we do this via roles, then you could be leaking role name
>>> definitions via this response. Not sure we care, but having a full
>>> list of roles that could possibly provide that permission is certainly
>>> a hole.
>>> 
>>> If we do this via _capability_, then you're looking at a set of
>>> permissions such as reader, writer, deleter, and that specific
>>> permission could be returned:
>>> 
>>> {"needed":"writer", "obtained":"reader"}
>>> 
>>> That'd work, but it's different from what Jan has proposed to date, I
>>> believe, especially in distinguishing between read, write, and delete.
>>> 
>>> -Joan
>> 
> 

-- 
Professional Support for Apache CouchDB:
https://neighbourhood.ie/couchdb-support/

Re: [DISCUSS] Per-doc access control

Posted by Robert Samuel Newson <rn...@apache.org>.

Hi,

It’s sounds like we require a separate flag as to whether any given username or role may appear in such an informative 403 message. That is, wherever we sa that X is granted access, we have an optional flag on X to say if we can disclose its right of access to others, defaulting to false if not present.

B.

> On 3 Apr 2019, at 23:21, Adam Kocoloski <ko...@apache.org> wrote:
> 
> Totally agree it’s information leakage - that’s why I found it surprising that this was their desired mode of operation. It works when there’s a relatively small set of labels that get applied to data, and the labels themselves are not all that confidential.
> 
> Adam
> 
>> On Apr 3, 2019, at 5:53 PM, Joan Touzet <wo...@apache.org> wrote:
>> 
>> One parenthetical...
>> 
>>> From: "Adam Kocoloski" <ko...@apache.org>
>>> 
>>> On a somewhat-related note, I have had conversations before with
>>> folks who are keen to adopt these sorts of fine-grained access
>>> control systems who said they actually prefer to have a 403
>>> Forbidden response list the set of privileges that would be
>>> sufficient to access the resource. I found this surprising, but I
>>> guess it comes down to a user needing to figure out what kind of
>>> security exception to apply for in order to make progress with some
>>> data analysis. I think this is a topic on which we could make a
>>> fairly late-binding decision — or even have it as a configurable
>>> option.
>> 
>> Anyone who's ever had to deal with Amazon's AWS IAM configuration
>> certainly can appreciate this need. I'm +1 on the idea, assuming it's
>> not hard to implement...but...
>> 
>> The problem is that it can be a data leak. In Jan's initial gist, he
>> shows the _access field being populated by usernames only (Scenario 1).
>> The only possible exception here is to get your username added to the
>> _access field on that document.
>> 
>> If we do this via roles, then you could be leaking role name
>> definitions via this response. Not sure we care, but having a full
>> list of roles that could possibly provide that permission is certainly
>> a hole.
>> 
>> If we do this via _capability_, then you're looking at a set of
>> permissions such as reader, writer, deleter, and that specific
>> permission could be returned:
>> 
>> {"needed":"writer", "obtained":"reader"}
>> 
>> That'd work, but it's different from what Jan has proposed to date, I
>> believe, especially in distinguishing between read, write, and delete.
>> 
>> -Joan
>

Re: [DISCUSS] Per-doc access control

Posted by Adam Kocoloski <ko...@apache.org>.

Totally agree it’s information leakage - that’s why I found it surprising that this was their desired mode of operation. It works when there’s a relatively small set of labels that get applied to data, and the labels themselves are not all that confidential.

Adam

> On Apr 3, 2019, at 5:53 PM, Joan Touzet <wo...@apache.org> wrote:
> 
> One parenthetical...
> 
>> From: "Adam Kocoloski" <ko...@apache.org>
>> 
>> On a somewhat-related note, I have had conversations before with
>> folks who are keen to adopt these sorts of fine-grained access
>> control systems who said they actually prefer to have a 403
>> Forbidden response list the set of privileges that would be
>> sufficient to access the resource. I found this surprising, but I
>> guess it comes down to a user needing to figure out what kind of
>> security exception to apply for in order to make progress with some
>> data analysis. I think this is a topic on which we could make a
>> fairly late-binding decision — or even have it as a configurable
>> option.
> 
> Anyone who's ever had to deal with Amazon's AWS IAM configuration
> certainly can appreciate this need. I'm +1 on the idea, assuming it's
> not hard to implement...but...
> 
> The problem is that it can be a data leak. In Jan's initial gist, he
> shows the _access field being populated by usernames only (Scenario 1).
> The only possible exception here is to get your username added to the
> _access field on that document.
> 
> If we do this via roles, then you could be leaking role name
> definitions via this response. Not sure we care, but having a full
> list of roles that could possibly provide that permission is certainly
> a hole.
> 
> If we do this via _capability_, then you're looking at a set of
> permissions such as reader, writer, deleter, and that specific
> permission could be returned:
> 
>  {"needed":"writer", "obtained":"reader"}
> 
> That'd work, but it's different from what Jan has proposed to date, I
> believe, especially in distinguishing between read, write, and delete.
> 
> -Joan

Re: [DISCUSS] Per-doc access control

Posted by Joan Touzet <wo...@apache.org>.

One parenthetical...

> From: "Adam Kocoloski" <ko...@apache.org>
> 
> On a somewhat-related note, I have had conversations before with
> folks who are keen to adopt these sorts of fine-grained access
> control systems who said they actually prefer to have a 403
> Forbidden response list the set of privileges that would be
> sufficient to access the resource. I found this surprising, but I
> guess it comes down to a user needing to figure out what kind of
> security exception to apply for in order to make progress with some
> data analysis. I think this is a topic on which we could make a
> fairly late-binding decision — or even have it as a configurable
> option.

Anyone who's ever had to deal with Amazon's AWS IAM configuration
certainly can appreciate this need. I'm +1 on the idea, assuming it's
not hard to implement...but...

The problem is that it can be a data leak. In Jan's initial gist, he
shows the _access field being populated by usernames only (Scenario 1).
The only possible exception here is to get your username added to the
_access field on that document.

If we do this via roles, then you could be leaking role name
definitions via this response. Not sure we care, but having a full
list of roles that could possibly provide that permission is certainly
a hole.

If we do this via _capability_, then you're looking at a set of
permissions such as reader, writer, deleter, and that specific
permission could be returned:

  {"needed":"writer", "obtained":"reader"}

That'd work, but it's different from what Jan has proposed to date, I
believe, especially in distinguishing between read, write, and delete.

-Joan

Re: [DISCUSS] Per-doc access control

Posted by Adam Kocoloski <ko...@apache.org>.

I’m also in favor of dropping Scenario 3.

One topic we may have discussed in the past but I wanted to close out here: in the relational database world it’s not uncommon to use materialized views as an access control mechanism to selectively expose contents of a table to clients who cannot access the table directly. Does the current thinking on _access for views support that use case? Can we build a view using a set of roles inherited from the user who created the design doc, but then turn around and set the _access on the view itself to a less-restrictive set?

On the _revs_diff topic — I’m not all that concerned about users trying to guess revision IDs that exist on the server, and then reverse-engineer the contents of the existing revisions. Maybe I ought to be.

On a somewhat-related note, I have had conversations before with folks who are keen to adopt these sorts of fine-grained access control systems who said they actually prefer to have a 403 Forbidden response list the set of privileges that would be sufficient to access the resource. I found this surprising, but I guess it comes down to a user needing to figure out what kind of security exception to apply for in order to make progress with some data analysis. I think this is a topic on which we could make a fairly late-binding decision — or even have it as a configurable option.

I could definitely see the base Scenario 1 (single _access labels) landing ahead of the more-complex sharing models.

I haven’t had a chance to take a deep look at the code but the design seems good and thoughtful, and I definitely like the focus on the use cases.

Adam

> On Mar 14, 2019, at 11:21 AM, Jan Lehnardt <ja...@apache.org> wrote:
> 
> My replies now inline.
> 
>> On 14. Mar 2019, at 16:13, Jan Lehnardt <ja...@apache.org> wrote:
>> 
>> I received some notes privately from Gregor Martynus, which I’m reproducing here in email thread form. This email is all Gregor’s notes, my next email is my replies to them.
>> 
>>> On 10. Mar 2019, at 15:51, Jan Lehnardt <ja...@apache.org> wrote:
>>> 
>>> Hey all,
>>> 
>>> after mulling this over some more, I’d like to tackle the detailed API and behaviour for this. Especially how _access work in conjunction with existing access control features.
>>> 
>>> My guiding principles so far are:
>>> 
>>> 1. Make the API intuitive, things should work like they look like they should work like.
>>> 2. The default should never be that a resources is accidentally left accessible to the public.
>>> 3. This should work as a natural extension to the existing security features*.
>>> 
>>> * I’d be up for reworking the whole lot, too, but that might be a better discussion for > 4.0.
>>> 
>>> 
>>> ## Database Creation and Default Behaviours
>>> 
>>> Creating a database with _access features is, as mentioned before done via a flag to PUT /database?access=true
>>> 
>>> In a 3.0 world where this would land, we already agreed that databases should be admin-only by default (instead of world read/writeable today). This is a sensible default, but that leaves us with an _access enabled database that can’t be used by anyone by server or db admins. Not very useful.
>>> 
>>> To allow arbitrary users to use the db, I suggest we use the existing _security system: i.e. if a user or a group a user belongs to is mentioned in either `admins` or `members` inside of _security, they can proceed and create documents on the db. This puts a second step burden on the application developer, but it slots cleanly into the existing security mechanisms, and doesn’t require special case handling. Alternatively, we could define that _security isn’t available in _access enabled databases, but that’s something I’d like to avoid if at all possible.
>>> 
>>> In order to make it easy to specify that “everyone in _users” should be able to use the db, I suggest we add a new role `_users` that is valid inside _security, which means “everyone in /_users” (this only excludes server admins which have full access anyway).
>>> 
>>> * * *
>>> 
>>> 
>>> ## Document Creation and Access Control
>>> 
>>> Next, one of our non-admin users creates a doc. There are multiple options as to how we store the _access information.
>>> 
>>> 1. Automatically translate the userCtx.name of a doc creation (not an update) into the first element of the _access array. E.g. user_a PUT /db/doc {"a":1} creates this doc: {"a":1,"_access":["user_a"]}. This is a little bit counter-intuitive.
>>> 
>>> 2. We require that a user puts "_access":["user_a"] in themselves. This is an explicit granting of access permissions on doc creation and I think is preferable.
>> 
>> I prefer being explicit.
>> 
>> 
>>> 
>>> This leaves the edge case of docs that have no _access member: so far I thought those docs are admin-only, with maybe a db-wide option to swap the default to public access, but I think given the explicitness of 2. we can do better: require _access for all new doc creations in access-enabled databases. A user can not create a new document without an _access field that is an array that has at least one member. For public documents, we could invent a new role _public, and admin-only docs could use the existing role _admin.
>>> 
>>> The one downside to this approach is that we won’t be able to replicate existing databases into an access-enabled database without modifying all documents. This might be a worthwhile trade-off, but we should make that decision consciously and document it well.
>> 
>> We could also provide tooling for migrations?
> 
> I’d love tooling, but we’d have to make sure we can do it correctly for a big number of use-cases. For the acceptance of this change, I’d make “documenting a migration path for db-per-user setups” a MUST have, and any code that helps with that a nice to have.
> 
>> 
>> 
>>> We could allow for a special case where an _admin user can create docs that have no _access field, and those docs are treated as having only the _admin role in _access. So at least we could replicate all data in, but then require a manual step to update all docs to say, migrate an existing db-per-user app, while not accidentally exposing any docs to folks that shouldn’t read them.
>>> 
>>> For the rest of cRUD, the existing document must store one of the RUD-ing user’s name or role in its _access field.
>>> 
>>> For both creations and updates, a user MUST supply at least one role they belong to or their own username.
>>> 
>>> * * *
>>> 
>>> 
>>> ## _revs_diff
>>> 
>>> /db/_revs_diff can answer the question of which revisions of a document do NOT exist on a replication target: http://docs.couchdb.org/en/stable/api/database/misc.html#db-revs-diff
>>> 
>>> This would allow users to specify ids and rev(s) for docs they don’t have access too (anymore), so the result schema should be expanded to handle id: unauthorized or somesuch, something the replicator needs to know what to do with, if it encounters it (say a user got removed from the _access list inbetween the replicator opening _changes and requesting the doc).
>>> 
>>> The _revs_diff implementation would have to altered to send an unauthorized token for each doc the requesting userCtx has no access to. If we can re-use some of our existing indexes, or any other performance optimisation, that’d be great. I haven’t looked at that code at all, yet.
>>> 
>>> An important side-effect of this is, once a user has been added to a doc’s _access list, they get access to “the full history of the doc”, even before they had access. Of course, in CouchDB this means only getting access to the rev ids, and not the content, but since they are content-addressable hashes, a user could brute-force themselves into revealing certain real values from earlier incarnations of the doc. I’d rather not track _access per document revision in perpetuity, so this is something we have to be very up-front about.
>>> 
>>> * * *
>>> 
>>> 
>>> ## Partitioned Databases
>>> 
>>> I mentioned partitioned databases in my previous mail, and I think it is something we can document that end-users can opt into, but doesn’t require any special casing on the _access proposal. That is, if users start prefixing their doc ids with a user name or id and enable both _access and partitions, then they get all the benefits of a partitioned database, and if they choose not to, they don’t, but things keep working. There are enough use-cases to warrant both behaviours.
>>> 
>>> * * *
>>> 
>>> 
>>> ## Scenarios that _access should help with.
>>> 
>>> Overall, we developed _access to allow users to stop using the db-per-user architecture, but once we have per-doc-access control, folks might start using this for all manner of things. We should be clear about which scenarios we support and which we don’t.
>>> 
>>> 
>>> ### Scenario 1: db-per-user
>>> 
>>> In this scenario, _access enabled databases, the only way to allow mutually untrusting users to store data in a part of CouchDB that only they (and admins) have access to was giving each user their own database.
>>> 
>>> In an _access enabled database, users can CRUD/_changes/_all_docs/_revs_diff their own docs knowing no other user (aside from admins) can access those docs.
>>> 
>>> This is the simplest scenario, as all we’d have to track the owner of a document and produce by-access-id/seq indexes based on that owner.
>>> 
>>> The current prototype implementation mostly reflects this stage. Not saying this is what we should ship, but it is the easiest do implement and explain.
>>> 
>>> Aside, I might be able to be persuaded to ship this as a 2.x feature, to help those folks who don’t need anything else.
>>> 
>>> 
>>> ### Scenario 2: db-per-user + Sharing
>> 
>> One scenario we should address is how stopping to share would work when documents are continuously replicated, e.g. to a client for offline usage. My understanding is that for the person who’s access to documents got revoked does not get _changes update telling them that their access got removed, it would be up to the application developer to implement some kind of "notification" meta documents. Unless you have a better idea?
> 
> Since we now have a purge API as well, we could treat an un-share as a purge for clients, and they can decide what to do with it.
> 
> Alternatively, we need to make breaking changes to _changes feed, maybe we can hide that behind an opt-in flag, like “/db/_changes?access=true”, and then we can send new rows like:
> 
> {seq: XYZ, id: abc, rev:4-YYY, _revoked: true} or somesuch.
> 
> 
>> 
>>> 
>>> The second we allow per doc auth, users will want to share those docs with other users. That’s why we initially suggested the _access field be an array, so other users and groups can be specified to have access. There are multiple scenarios in this one alone:
>>> 
>>> #### 2.1: The Todo List
>>> 
>>> In this scenario, a user has a reasonable amount of ”personal data” that they want to selectively share with one or more other users.
>>> 
>>> #### 2.2: The Chat/Forum/Newsgroup
>>> 
>>> In this scenario, a user wants to share any number of documents with a reasonable number of groups. However, since we need to limit the number of groups a user belongs to (currently 10, see below for details), this might actually not be a great solution. Or folks couldn’t be in more than 10 chat groups at a time.
>>> 
>>> #### 2.3: The Corporate Hierarchy
>>> 
>>> In this scenario, users want to share any number of docs with a reasonable number of groups in a top-down/bottom-up fashion. Think CEO shares with executives, execs share with divisions, divisions report up to their one executive, etc.
>>> 
>>> 
>>> ### 3: Multiple Apps
>>> 
>>> The preceding scenarios all assume that a single application is responsible for everything. However, once we allow mutually distrusting users into a single database *and* make each per-user slice work (almost) like a full standalone CouchDB database, what would stop users from using this for a multi-homing feature, where different applications are used for each user in the same database?
>>> 
>>> I’ll be referring to these scenarios down the line.
>>> 
>>> * * *
>>> 
>>> 
>>> ## Design Docs
>>> 
>>> ### Admin
>>> 
>>> One of the downsides of db-per-user is managing design docs in the face of a changing application, that is, how to distribute new design docs across 10s of 1000+s of user dbs? It’s not impossible, but tedious. In all scenarios above but scenario 3., we could simplify this significantly. Say an admin creates a design doc, and gives all users in the db access to this design doc (this could be with the _users role, or yet another new role _members, if we need it), requesting the result of a view defined in that design doc will produce an index that is powered by the requesting user’s by-access-seq index section(s).
>>> 
>>> N.B., this would require us to change a fundamental assumption when doing the association between a design doc’s definition and index: normally, there is only the `views` member that is hashed and that hash is used as the index’s filename. Because there is only by-seq to power a view, that all works. But now that we have an arbitrary set of sections on by-access-seq, any view index built will have to take a user’s name and roles into account. When a user leaves a group, or gains a group, all indexes for that user will no longer be valid and need rebuilding.
>>> 
>>> 
>>> ### User
>>> 
>>> In any of the scenarios above, but especially 3., there could be legitimate per-user design docs, so how should those be treated in an _access enabled database?
>>> 
>>> The significant fields in a design doc are `views`, `validate_doc_update` and `filters` (I’ll skip over the deprecated _show, _list, and _update).
>>> 
>>> The easiest to handle is a `filters`: if a user specifies a filter for a _changes request or replication that lives in a design doc they don’t have access to, they get an error, similar to if they specify a non-existent design doc, just with `unauthorized` instead of `not_found`.
>>> 
>>> Next `views` is also not very hard to imagine working: just like globally defined views for that db, the index is built for each user based on the user’s name and roles.
>>> 
>>> More troubling are `validate_doc_update` functions: One, they are already troubling in that they slow down any document updates. Two, if we now import an existing db-per-user scenario where each user has their own design docs,
>> 
>> I can’t think of a db-per-user scenario where each user DB would have a different validate_doc_update method? It would be the same method with access to the user context, the DBs security setting and the document, so it would act differently for different users, but using the same code.
> 
> They wouldn’t be different, but if we were do replicate 1000 db-per-user design docs into a single database, as per today’s semantics, we’d have to run 1000 VDUs on each doc update.
> 
>> 
>>> how should we apply validate_doc_update functions? 10s of 1000s of VDUs are impractical to apply on each doc update, let alone just the management of VDUs that are active on a database. One option would be to ignore VDUs if they are not defined globally (say with a _members role). But especially in scenario 3. this becomes problematic, but even without that specific scenario, this violates the no surprises best practice.
>>> 
>>> We could say:
>>> 
>>> a) we don’t support scenario 3.
>> 
>> +1, I think it would make our lives easier in general if we don’t recommend to share the same CouchDB for multiple apps. At least I don’t see a reason to do that at this point.
> 
> I think I like this best, too, but I’d like to hear from others as well.
> 
> 
> Best
> Jan
> —
>> 
>>> b) we find a complicated but efficient way to apply only those VDUs that are defined in design docs the writing user has access to plus any global ones (this would be neat but rather complicated and potentially still impractical from a performance perspective for N users).
>>> c) we could store all per-user design docs, but ignore them completely, VDUs, views and filters.
>>> 
>>> I think I currently fall on the side of not supporting scenario 3. and asking folks who migrate db-per-user to de-duplicate design docs and keep them per-app. I believe that is a good trade-off between the most common scenarios for db-per-user while keeping the implementation manageable. Globally accessible design docs would show up in a user’s changes feed and would replicate down to say a PouchDB application which might be the exclusive user of those design docs.
>>> 
>>> In practice this would mean, a document that has an _id that starts with _design/ will have to be produced by a database admin. Luckily, that’s already the case. We should just make sure that folks don’t give db-admin access to all users habitually.
>>> 
>>> 
>>> ## Read and Write Access
>>> 
>>> Speaking of validate_doc_update, it is used for two things: checking document schema and doc update authorisation.
>>> 
>>> Once we allow access to a document with an _access field, we need to decide what kind of access this gives to a doc: read-only or read-write (I’m not considering write-only because for anything but doc creations this is not useful as you need access to the current _rev).
>>> 
>>> However, when we look at implementing an application on top of our existing API, it is already weird that read access can be controlled globally (or with _access on a per doc level), but write access requires writing JavaScript code. I think it would be a reasonable expectation for users to expect a per-doc read/write permission granting.
>> 
>> Yes!
>> 
>>> 
>>> So we could have all of the above, but with two extra fields: _access_read and _access_write, or _access: {read: [], write: []}
>> 
>> I prefer this API for its compactness, thinking about offline synchronization. The smaller the docs, the better.
>> 
>> Best
>> “Gregor”
>> —
>> 
>> 
>>> or we overload user and group names: _access: [user_a:read, user_b:write] (or any permutation thereof). Overloading can cause trouble with naturally occurring characters in group names.
>>> 
>>> The former seems more explicit, but from an API perspective that’s a little more awkward: remember that we currently have an arbitrary limit of 10 members in a user’s role array, to avoid excessive fan out on cluster-internal operations. Partitioned dbs could get away with more, more easily however. If we allow the specification of access control in two lists, and one of the lists implies membership in the other, we have a total limit of 10 members across both arrays. Or we limit 5 + 5, but that seems excessive, while 10 total seems weird, but doable. Anyway, good bikeshed.
>>> 
>>> 
>>> * * * 
>>> 
>>> 
>>> So far. I think all of the problems outlined are solvable, if with a clear definition of what use-cases we do not support with access. If you have more scenarios than the ones I outlined, please add them and we can see if they cause any additional trouble.
>>> 
>>> Thanks for reading this far and I’m looking forward to your feedback.
>>> 
>>> 
>>> Best,
>>> Jan “_access” Lehnardt
>>> —
>>> 
>>> 
>>> 
>>> 
>>>> On 17. Feb 2019, at 15:25, Jan Lehnardt <ja...@apache.org> wrote:
>>>> 
>>>> Hi Everyone,
>>>> 
>>>> I’m happy to share my work in progress attempt to implement the per-doc access control feature we discussed a good while ago:
>>>> 
>>>> https://lists.apache.org/thread.html/6aa77dd8e5974a3a540758c6902ccb509ab5a2e4802ecf4fd724a5e4@%3Cdev.couchdb.apache.org%3E <https://lists.apache.org/thread.html/6aa77dd8e5974a3a540758c6902ccb509ab5a2e4802ecf4fd724a5e4@%3Cdev.couchdb.apache.org%3E>
>>>> 
>>>> You can check out my branch here:
>>>> 
>>>> https://github.com/apache/couchdb/compare/access?expand=1 <https://github.com/apache/couchdb/compare/access?expand=1>
>>>> 
>>>> It is very much work in progress, but it is far enough along to warrant discussion.
>>>> 
>>>> The main point of this branch is to show all the places that we would need to change to support the proposal.
>>>> 
>>>> Things I’ve left for later:
>>>> 
>>>> - currently only the first element in the _access array is used. Our and/or syntax can be added later.
>>>> - building per-access views has not been implemented yet, couch_index would have to be taught about the new per-access-id index.
>>>> - pretty HTTP error handling
>>>> - tests except for a tiny shell script 😇
>>>> 
>>>> Implementation notes:
>>>> 
>>>> You create a database with the _access feature turned on like so:  PUT /db?access=true
>>>> 
>>>> I started out with storing _access in the document body, as that would allow for a minimal change set, however, on doc updates, we try hard not to load the old doc body from the database, and forcing us to do so for EVERY doc update under _access seemed prohibitive, so I extended the #doc, #doc_info and #full_doc_info records with a new `access` attribute that is stored in both by-id and by-seq. I will need guidance on how extending these records impact multi-version cluster interop. And especially whether this is an acceptable approach.
>>>> 
>>>> https://github.com/apache/couchdb/compare/access?expand=1&ws=0#diff-904ab7473ff8ddd07ea44aca414e3a36
>>>> 
>>>> * * *
>>>> 
>>>> The main addition is a new native query server called couch_access_native_proc, which implements two new indexes by-access-id and by-access-seq which do what you’d expect, pass in a userCtx and retrieve the equivalent of _all_docs or _changes, but only including those docs that match the username and roles in their _access property. The existing handlers for _all_docs and _changes have been augmented to use the new indexes instead of the default ones, unless the user is an admin.
>>>> 
>>>> https://github.com/apache/couchdb/compare/access?expand=1&ws=0#diff-fbb53323f07579be5e46ba63cb6701c4
>>>> 
>>>> * * *
>>>> 
>>>> The rest of the diff is concerned with making document CRUD behave as you’d expect it. See this little demonstration for what things look like:
>>>> 
>>>> https://gist.github.com/janl/b6d3f7502aa20b7b9ab9d9dcb8e92497 <https://gist.github.com/janl/b6d3f7502aa20b7b9ab9d9dcb8e92497> (I’m just noticing that there might be something wonky with DELETE, but you’ll get the gist #rimshot)
>>>> 
>>>> * * *
>>>> 
>>>> Open questions:
>>>> 
>>>> - The aim of this is to get as close to regular CouchDB behaviour as possible. One thing that is new however which would require all apps to be changed is that for an _access enabled database to include an _access field in their docs (docs with no _access are admin-only for now). We might want to consider on new document writes to auto-insert the authenticated user’s name as the first element in the _access array, so existing apps “just work”.
>>>> 
>>>> - Interplay with partitioned dbs: eschewing db-per-user is already a large boon if you have a lot of users, but making those per-user requests inside an _access enabled database efficient would be doubly nice, so why not use the username from the first question above and use that as the partition key? This would work nicely for natural users with their own docs that want to share them with others later, but I can easily imagine a pipelined use of CouchDB, where a “collector” user creates all new docs, an “analyser” takes them over and hand them to a “result” user for viewing. In that case, we’d violate the high-cardinality rule of partitions (have a lot of small ones), instead all docs go through all three users. I’d be okay with treating the later scenario as a minor use-case, but for that use-case, we should be able to disable auto-partitioning on db creation.
>>>> 
>>>> - building access view indexes for docs that have frequent _access changes, lead to many orphaned view indexes, we should look at an auto-cleanup solution here (maybe keep 1-N indexes in case folks just swap back and forth).
>>>> 
>>>> * * *
>>>> 
>>>> I’ll leave this here for now, I’m sure there are a few more things to consider.
>>>> 
>>>> I’d love to hear any and all feedback you might have. Especially if anything is unclear.
>>>> 
>>>> Best
>>>> Jan
>>>> —
>>> 
>>> -- 
>>> Professional Support for Apache CouchDB:
>>> https://neighbourhood.ie/couchdb-support/
>>> 
>> 
>> -- 
>> Professional Support for Apache CouchDB:
>> https://neighbourhood.ie/couchdb-support/
>> 
> 
> -- 
> Professional Support for Apache CouchDB:
> https://neighbourhood.ie/couchdb-support/ <https://neighbourhood.ie/couchdb-support/>

Re: [DISCUSS] Per-doc access control

Posted by Jan Lehnardt <ja...@apache.org>.

My replies now inline.

> On 14. Mar 2019, at 16:13, Jan Lehnardt <ja...@apache.org> wrote:
> 
> I received some notes privately from Gregor Martynus, which I’m reproducing here in email thread form. This email is all Gregor’s notes, my next email is my replies to them.
> 
>> On 10. Mar 2019, at 15:51, Jan Lehnardt <ja...@apache.org> wrote:
>> 
>> Hey all,
>> 
>> after mulling this over some more, I’d like to tackle the detailed API and behaviour for this. Especially how _access work in conjunction with existing access control features.
>> 
>> My guiding principles so far are:
>> 
>> 1. Make the API intuitive, things should work like they look like they should work like.
>> 2. The default should never be that a resources is accidentally left accessible to the public.
>> 3. This should work as a natural extension to the existing security features*.
>> 
>> * I’d be up for reworking the whole lot, too, but that might be a better discussion for > 4.0.
>> 
>> 
>> ## Database Creation and Default Behaviours
>> 
>> Creating a database with _access features is, as mentioned before done via a flag to PUT /database?access=true
>> 
>> In a 3.0 world where this would land, we already agreed that databases should be admin-only by default (instead of world read/writeable today). This is a sensible default, but that leaves us with an _access enabled database that can’t be used by anyone by server or db admins. Not very useful.
>> 
>> To allow arbitrary users to use the db, I suggest we use the existing _security system: i.e. if a user or a group a user belongs to is mentioned in either `admins` or `members` inside of _security, they can proceed and create documents on the db. This puts a second step burden on the application developer, but it slots cleanly into the existing security mechanisms, and doesn’t require special case handling. Alternatively, we could define that _security isn’t available in _access enabled databases, but that’s something I’d like to avoid if at all possible.
>> 
>> In order to make it easy to specify that “everyone in _users” should be able to use the db, I suggest we add a new role `_users` that is valid inside _security, which means “everyone in /_users” (this only excludes server admins which have full access anyway).
>> 
>> * * *
>> 
>> 
>> ## Document Creation and Access Control
>> 
>> Next, one of our non-admin users creates a doc. There are multiple options as to how we store the _access information.
>> 
>> 1. Automatically translate the userCtx.name of a doc creation (not an update) into the first element of the _access array. E.g. user_a PUT /db/doc {"a":1} creates this doc: {"a":1,"_access":["user_a"]}. This is a little bit counter-intuitive.
>> 
>> 2. We require that a user puts "_access":["user_a"] in themselves. This is an explicit granting of access permissions on doc creation and I think is preferable.
> 
> I prefer being explicit.
> 
> 
>> 
>> This leaves the edge case of docs that have no _access member: so far I thought those docs are admin-only, with maybe a db-wide option to swap the default to public access, but I think given the explicitness of 2. we can do better: require _access for all new doc creations in access-enabled databases. A user can not create a new document without an _access field that is an array that has at least one member. For public documents, we could invent a new role _public, and admin-only docs could use the existing role _admin.
>> 
>> The one downside to this approach is that we won’t be able to replicate existing databases into an access-enabled database without modifying all documents. This might be a worthwhile trade-off, but we should make that decision consciously and document it well.
> 
> We could also provide tooling for migrations?

I’d love tooling, but we’d have to make sure we can do it correctly for a big number of use-cases. For the acceptance of this change, I’d make “documenting a migration path for db-per-user setups” a MUST have, and any code that helps with that a nice to have.

> 
> 
>> We could allow for a special case where an _admin user can create docs that have no _access field, and those docs are treated as having only the _admin role in _access. So at least we could replicate all data in, but then require a manual step to update all docs to say, migrate an existing db-per-user app, while not accidentally exposing any docs to folks that shouldn’t read them.
>> 
>> For the rest of cRUD, the existing document must store one of the RUD-ing user’s name or role in its _access field.
>> 
>> For both creations and updates, a user MUST supply at least one role they belong to or their own username.
>> 
>> * * *
>> 
>> 
>> ## _revs_diff
>> 
>> /db/_revs_diff can answer the question of which revisions of a document do NOT exist on a replication target: http://docs.couchdb.org/en/stable/api/database/misc.html#db-revs-diff
>> 
>> This would allow users to specify ids and rev(s) for docs they don’t have access too (anymore), so the result schema should be expanded to handle id: unauthorized or somesuch, something the replicator needs to know what to do with, if it encounters it (say a user got removed from the _access list inbetween the replicator opening _changes and requesting the doc).
>> 
>> The _revs_diff implementation would have to altered to send an unauthorized token for each doc the requesting userCtx has no access to. If we can re-use some of our existing indexes, or any other performance optimisation, that’d be great. I haven’t looked at that code at all, yet.
>> 
>> An important side-effect of this is, once a user has been added to a doc’s _access list, they get access to “the full history of the doc”, even before they had access. Of course, in CouchDB this means only getting access to the rev ids, and not the content, but since they are content-addressable hashes, a user could brute-force themselves into revealing certain real values from earlier incarnations of the doc. I’d rather not track _access per document revision in perpetuity, so this is something we have to be very up-front about.
>> 
>> * * *
>> 
>> 
>> ## Partitioned Databases
>> 
>> I mentioned partitioned databases in my previous mail, and I think it is something we can document that end-users can opt into, but doesn’t require any special casing on the _access proposal. That is, if users start prefixing their doc ids with a user name or id and enable both _access and partitions, then they get all the benefits of a partitioned database, and if they choose not to, they don’t, but things keep working. There are enough use-cases to warrant both behaviours.
>> 
>> * * *
>> 
>> 
>> ## Scenarios that _access should help with.
>> 
>> Overall, we developed _access to allow users to stop using the db-per-user architecture, but once we have per-doc-access control, folks might start using this for all manner of things. We should be clear about which scenarios we support and which we don’t.
>> 
>> 
>> ### Scenario 1: db-per-user
>> 
>> In this scenario, _access enabled databases, the only way to allow mutually untrusting users to store data in a part of CouchDB that only they (and admins) have access to was giving each user their own database.
>> 
>> In an _access enabled database, users can CRUD/_changes/_all_docs/_revs_diff their own docs knowing no other user (aside from admins) can access those docs.
>> 
>> This is the simplest scenario, as all we’d have to track the owner of a document and produce by-access-id/seq indexes based on that owner.
>> 
>> The current prototype implementation mostly reflects this stage. Not saying this is what we should ship, but it is the easiest do implement and explain.
>> 
>> Aside, I might be able to be persuaded to ship this as a 2.x feature, to help those folks who don’t need anything else.
>> 
>> 
>> ### Scenario 2: db-per-user + Sharing
> 
> One scenario we should address is how stopping to share would work when documents are continuously replicated, e.g. to a client for offline usage. My understanding is that for the person who’s access to documents got revoked does not get _changes update telling them that their access got removed, it would be up to the application developer to implement some kind of "notification" meta documents. Unless you have a better idea?

Since we now have a purge API as well, we could treat an un-share as a purge for clients, and they can decide what to do with it.

Alternatively, we need to make breaking changes to _changes feed, maybe we can hide that behind an opt-in flag, like “/db/_changes?access=true”, and then we can send new rows like:

{seq: XYZ, id: abc, rev:4-YYY, _revoked: true} or somesuch.


> 
>> 
>> The second we allow per doc auth, users will want to share those docs with other users. That’s why we initially suggested the _access field be an array, so other users and groups can be specified to have access. There are multiple scenarios in this one alone:
>> 
>> #### 2.1: The Todo List
>> 
>> In this scenario, a user has a reasonable amount of ”personal data” that they want to selectively share with one or more other users.
>> 
>> #### 2.2: The Chat/Forum/Newsgroup
>> 
>> In this scenario, a user wants to share any number of documents with a reasonable number of groups. However, since we need to limit the number of groups a user belongs to (currently 10, see below for details), this might actually not be a great solution. Or folks couldn’t be in more than 10 chat groups at a time.
>> 
>> #### 2.3: The Corporate Hierarchy
>> 
>> In this scenario, users want to share any number of docs with a reasonable number of groups in a top-down/bottom-up fashion. Think CEO shares with executives, execs share with divisions, divisions report up to their one executive, etc.
>> 
>> 
>> ### 3: Multiple Apps
>> 
>> The preceding scenarios all assume that a single application is responsible for everything. However, once we allow mutually distrusting users into a single database *and* make each per-user slice work (almost) like a full standalone CouchDB database, what would stop users from using this for a multi-homing feature, where different applications are used for each user in the same database?
>> 
>> I’ll be referring to these scenarios down the line.
>> 
>> * * *
>> 
>> 
>> ## Design Docs
>> 
>> ### Admin
>> 
>> One of the downsides of db-per-user is managing design docs in the face of a changing application, that is, how to distribute new design docs across 10s of 1000+s of user dbs? It’s not impossible, but tedious. In all scenarios above but scenario 3., we could simplify this significantly. Say an admin creates a design doc, and gives all users in the db access to this design doc (this could be with the _users role, or yet another new role _members, if we need it), requesting the result of a view defined in that design doc will produce an index that is powered by the requesting user’s by-access-seq index section(s).
>> 
>> N.B., this would require us to change a fundamental assumption when doing the association between a design doc’s definition and index: normally, there is only the `views` member that is hashed and that hash is used as the index’s filename. Because there is only by-seq to power a view, that all works. But now that we have an arbitrary set of sections on by-access-seq, any view index built will have to take a user’s name and roles into account. When a user leaves a group, or gains a group, all indexes for that user will no longer be valid and need rebuilding.
>> 
>> 
>> ### User
>> 
>> In any of the scenarios above, but especially 3., there could be legitimate per-user design docs, so how should those be treated in an _access enabled database?
>> 
>> The significant fields in a design doc are `views`, `validate_doc_update` and `filters` (I’ll skip over the deprecated _show, _list, and _update).
>> 
>> The easiest to handle is a `filters`: if a user specifies a filter for a _changes request or replication that lives in a design doc they don’t have access to, they get an error, similar to if they specify a non-existent design doc, just with `unauthorized` instead of `not_found`.
>> 
>> Next `views` is also not very hard to imagine working: just like globally defined views for that db, the index is built for each user based on the user’s name and roles.
>> 
>> More troubling are `validate_doc_update` functions: One, they are already troubling in that they slow down any document updates. Two, if we now import an existing db-per-user scenario where each user has their own design docs,
> 
> I can’t think of a db-per-user scenario where each user DB would have a different validate_doc_update method? It would be the same method with access to the user context, the DBs security setting and the document, so it would act differently for different users, but using the same code.

They wouldn’t be different, but if we were do replicate 1000 db-per-user design docs into a single database, as per today’s semantics, we’d have to run 1000 VDUs on each doc update.

> 
>> how should we apply validate_doc_update functions? 10s of 1000s of VDUs are impractical to apply on each doc update, let alone just the management of VDUs that are active on a database. One option would be to ignore VDUs if they are not defined globally (say with a _members role). But especially in scenario 3. this becomes problematic, but even without that specific scenario, this violates the no surprises best practice.
>> 
>> We could say:
>> 
>> a) we don’t support scenario 3.
> 
> +1, I think it would make our lives easier in general if we don’t recommend to share the same CouchDB for multiple apps. At least I don’t see a reason to do that at this point.

I think I like this best, too, but I’d like to hear from others as well.


Best
Jan
—
> 
>> b) we find a complicated but efficient way to apply only those VDUs that are defined in design docs the writing user has access to plus any global ones (this would be neat but rather complicated and potentially still impractical from a performance perspective for N users).
>> c) we could store all per-user design docs, but ignore them completely, VDUs, views and filters.
>> 
>> I think I currently fall on the side of not supporting scenario 3. and asking folks who migrate db-per-user to de-duplicate design docs and keep them per-app. I believe that is a good trade-off between the most common scenarios for db-per-user while keeping the implementation manageable. Globally accessible design docs would show up in a user’s changes feed and would replicate down to say a PouchDB application which might be the exclusive user of those design docs.
>> 
>> In practice this would mean, a document that has an _id that starts with _design/ will have to be produced by a database admin. Luckily, that’s already the case. We should just make sure that folks don’t give db-admin access to all users habitually.
>> 
>> 
>> ## Read and Write Access
>> 
>> Speaking of validate_doc_update, it is used for two things: checking document schema and doc update authorisation.
>> 
>> Once we allow access to a document with an _access field, we need to decide what kind of access this gives to a doc: read-only or read-write (I’m not considering write-only because for anything but doc creations this is not useful as you need access to the current _rev).
>> 
>> However, when we look at implementing an application on top of our existing API, it is already weird that read access can be controlled globally (or with _access on a per doc level), but write access requires writing JavaScript code. I think it would be a reasonable expectation for users to expect a per-doc read/write permission granting.
> 
> Yes!
> 
>> 
>> So we could have all of the above, but with two extra fields: _access_read and _access_write, or _access: {read: [], write: []}
> 
> I prefer this API for its compactness, thinking about offline synchronization. The smaller the docs, the better.
> 
> Best
> “Gregor”
> —
> 
> 
>> or we overload user and group names: _access: [user_a:read, user_b:write] (or any permutation thereof). Overloading can cause trouble with naturally occurring characters in group names.
>> 
>> The former seems more explicit, but from an API perspective that’s a little more awkward: remember that we currently have an arbitrary limit of 10 members in a user’s role array, to avoid excessive fan out on cluster-internal operations. Partitioned dbs could get away with more, more easily however. If we allow the specification of access control in two lists, and one of the lists implies membership in the other, we have a total limit of 10 members across both arrays. Or we limit 5 + 5, but that seems excessive, while 10 total seems weird, but doable. Anyway, good bikeshed.
>> 
>> 
>> * * * 
>> 
>> 
>> So far. I think all of the problems outlined are solvable, if with a clear definition of what use-cases we do not support with access. If you have more scenarios than the ones I outlined, please add them and we can see if they cause any additional trouble.
>> 
>> Thanks for reading this far and I’m looking forward to your feedback.
>> 
>> 
>> Best,
>> Jan “_access” Lehnardt
>> —
>> 
>> 
>> 
>> 
>>> On 17. Feb 2019, at 15:25, Jan Lehnardt <ja...@apache.org> wrote:
>>> 
>>> Hi Everyone,
>>> 
>>> I’m happy to share my work in progress attempt to implement the per-doc access control feature we discussed a good while ago:
>>> 
>>> https://lists.apache.org/thread.html/6aa77dd8e5974a3a540758c6902ccb509ab5a2e4802ecf4fd724a5e4@%3Cdev.couchdb.apache.org%3E <https://lists.apache.org/thread.html/6aa77dd8e5974a3a540758c6902ccb509ab5a2e4802ecf4fd724a5e4@%3Cdev.couchdb.apache.org%3E>
>>> 
>>> You can check out my branch here:
>>> 
>>> https://github.com/apache/couchdb/compare/access?expand=1 <https://github.com/apache/couchdb/compare/access?expand=1>
>>> 
>>> It is very much work in progress, but it is far enough along to warrant discussion.
>>> 
>>> The main point of this branch is to show all the places that we would need to change to support the proposal.
>>> 
>>> Things I’ve left for later:
>>> 
>>> - currently only the first element in the _access array is used. Our and/or syntax can be added later.
>>> - building per-access views has not been implemented yet, couch_index would have to be taught about the new per-access-id index.
>>> - pretty HTTP error handling
>>> - tests except for a tiny shell script 😇
>>> 
>>> Implementation notes:
>>> 
>>> You create a database with the _access feature turned on like so:  PUT /db?access=true
>>> 
>>> I started out with storing _access in the document body, as that would allow for a minimal change set, however, on doc updates, we try hard not to load the old doc body from the database, and forcing us to do so for EVERY doc update under _access seemed prohibitive, so I extended the #doc, #doc_info and #full_doc_info records with a new `access` attribute that is stored in both by-id and by-seq. I will need guidance on how extending these records impact multi-version cluster interop. And especially whether this is an acceptable approach.
>>> 
>>> https://github.com/apache/couchdb/compare/access?expand=1&ws=0#diff-904ab7473ff8ddd07ea44aca414e3a36
>>> 
>>> * * *
>>> 
>>> The main addition is a new native query server called couch_access_native_proc, which implements two new indexes by-access-id and by-access-seq which do what you’d expect, pass in a userCtx and retrieve the equivalent of _all_docs or _changes, but only including those docs that match the username and roles in their _access property. The existing handlers for _all_docs and _changes have been augmented to use the new indexes instead of the default ones, unless the user is an admin.
>>> 
>>> https://github.com/apache/couchdb/compare/access?expand=1&ws=0#diff-fbb53323f07579be5e46ba63cb6701c4
>>> 
>>> * * *
>>> 
>>> The rest of the diff is concerned with making document CRUD behave as you’d expect it. See this little demonstration for what things look like:
>>> 
>>> https://gist.github.com/janl/b6d3f7502aa20b7b9ab9d9dcb8e92497 <https://gist.github.com/janl/b6d3f7502aa20b7b9ab9d9dcb8e92497> (I’m just noticing that there might be something wonky with DELETE, but you’ll get the gist #rimshot)
>>> 
>>> * * *
>>> 
>>> Open questions:
>>> 
>>> - The aim of this is to get as close to regular CouchDB behaviour as possible. One thing that is new however which would require all apps to be changed is that for an _access enabled database to include an _access field in their docs (docs with no _access are admin-only for now). We might want to consider on new document writes to auto-insert the authenticated user’s name as the first element in the _access array, so existing apps “just work”.
>>> 
>>> - Interplay with partitioned dbs: eschewing db-per-user is already a large boon if you have a lot of users, but making those per-user requests inside an _access enabled database efficient would be doubly nice, so why not use the username from the first question above and use that as the partition key? This would work nicely for natural users with their own docs that want to share them with others later, but I can easily imagine a pipelined use of CouchDB, where a “collector” user creates all new docs, an “analyser” takes them over and hand them to a “result” user for viewing. In that case, we’d violate the high-cardinality rule of partitions (have a lot of small ones), instead all docs go through all three users. I’d be okay with treating the later scenario as a minor use-case, but for that use-case, we should be able to disable auto-partitioning on db creation.
>>> 
>>> - building access view indexes for docs that have frequent _access changes, lead to many orphaned view indexes, we should look at an auto-cleanup solution here (maybe keep 1-N indexes in case folks just swap back and forth).
>>> 
>>> * * *
>>> 
>>> I’ll leave this here for now, I’m sure there are a few more things to consider.
>>> 
>>> I’d love to hear any and all feedback you might have. Especially if anything is unclear.
>>> 
>>> Best
>>> Jan
>>> —
>> 
>> -- 
>> Professional Support for Apache CouchDB:
>> https://neighbourhood.ie/couchdb-support/
>> 
> 
> -- 
> Professional Support for Apache CouchDB:
> https://neighbourhood.ie/couchdb-support/
> 

-- 
Professional Support for Apache CouchDB:
https://neighbourhood.ie/couchdb-support/

Re: [DISCUSS] Per-doc access control

Posted by Jan Lehnardt <ja...@apache.org>.

I received some notes privately from Gregor Martynus, which I’m reproducing here in email thread form. This email is all Gregor’s notes, my next email is my replies to them.

> On 10. Mar 2019, at 15:51, Jan Lehnardt <ja...@apache.org> wrote:
> 
> Hey all,
> 
> after mulling this over some more, I’d like to tackle the detailed API and behaviour for this. Especially how _access work in conjunction with existing access control features.
> 
> My guiding principles so far are:
> 
> 1. Make the API intuitive, things should work like they look like they should work like.
> 2. The default should never be that a resources is accidentally left accessible to the public.
> 3. This should work as a natural extension to the existing security features*.
> 
> * I’d be up for reworking the whole lot, too, but that might be a better discussion for > 4.0.
> 
> 
> ## Database Creation and Default Behaviours
> 
> Creating a database with _access features is, as mentioned before done via a flag to PUT /database?access=true
> 
> In a 3.0 world where this would land, we already agreed that databases should be admin-only by default (instead of world read/writeable today). This is a sensible default, but that leaves us with an _access enabled database that can’t be used by anyone by server or db admins. Not very useful.
> 
> To allow arbitrary users to use the db, I suggest we use the existing _security system: i.e. if a user or a group a user belongs to is mentioned in either `admins` or `members` inside of _security, they can proceed and create documents on the db. This puts a second step burden on the application developer, but it slots cleanly into the existing security mechanisms, and doesn’t require special case handling. Alternatively, we could define that _security isn’t available in _access enabled databases, but that’s something I’d like to avoid if at all possible.
> 
> In order to make it easy to specify that “everyone in _users” should be able to use the db, I suggest we add a new role `_users` that is valid inside _security, which means “everyone in /_users” (this only excludes server admins which have full access anyway).
> 
> * * *
> 
> 
> ## Document Creation and Access Control
> 
> Next, one of our non-admin users creates a doc. There are multiple options as to how we store the _access information.
> 
> 1. Automatically translate the userCtx.name of a doc creation (not an update) into the first element of the _access array. E.g. user_a PUT /db/doc {"a":1} creates this doc: {"a":1,"_access":["user_a"]}. This is a little bit counter-intuitive.
> 
> 2. We require that a user puts "_access":["user_a"] in themselves. This is an explicit granting of access permissions on doc creation and I think is preferable.

I prefer being explicit.


> 
> This leaves the edge case of docs that have no _access member: so far I thought those docs are admin-only, with maybe a db-wide option to swap the default to public access, but I think given the explicitness of 2. we can do better: require _access for all new doc creations in access-enabled databases. A user can not create a new document without an _access field that is an array that has at least one member. For public documents, we could invent a new role _public, and admin-only docs could use the existing role _admin.
> 
> The one downside to this approach is that we won’t be able to replicate existing databases into an access-enabled database without modifying all documents. This might be a worthwhile trade-off, but we should make that decision consciously and document it well.

We could also provide tooling for migrations?


> We could allow for a special case where an _admin user can create docs that have no _access field, and those docs are treated as having only the _admin role in _access. So at least we could replicate all data in, but then require a manual step to update all docs to say, migrate an existing db-per-user app, while not accidentally exposing any docs to folks that shouldn’t read them.
> 
> For the rest of cRUD, the existing document must store one of the RUD-ing user’s name or role in its _access field.
> 
> For both creations and updates, a user MUST supply at least one role they belong to or their own username.
> 
> * * *
> 
> 
> ## _revs_diff
> 
> /db/_revs_diff can answer the question of which revisions of a document do NOT exist on a replication target: http://docs.couchdb.org/en/stable/api/database/misc.html#db-revs-diff
> 
> This would allow users to specify ids and rev(s) for docs they don’t have access too (anymore), so the result schema should be expanded to handle id: unauthorized or somesuch, something the replicator needs to know what to do with, if it encounters it (say a user got removed from the _access list inbetween the replicator opening _changes and requesting the doc).
> 
> The _revs_diff implementation would have to altered to send an unauthorized token for each doc the requesting userCtx has no access to. If we can re-use some of our existing indexes, or any other performance optimisation, that’d be great. I haven’t looked at that code at all, yet.
> 
> An important side-effect of this is, once a user has been added to a doc’s _access list, they get access to “the full history of the doc”, even before they had access. Of course, in CouchDB this means only getting access to the rev ids, and not the content, but since they are content-addressable hashes, a user could brute-force themselves into revealing certain real values from earlier incarnations of the doc. I’d rather not track _access per document revision in perpetuity, so this is something we have to be very up-front about.
> 
> * * *
> 
> 
> ## Partitioned Databases
> 
> I mentioned partitioned databases in my previous mail, and I think it is something we can document that end-users can opt into, but doesn’t require any special casing on the _access proposal. That is, if users start prefixing their doc ids with a user name or id and enable both _access and partitions, then they get all the benefits of a partitioned database, and if they choose not to, they don’t, but things keep working. There are enough use-cases to warrant both behaviours.
> 
> * * *
> 
> 
> ## Scenarios that _access should help with.
> 
> Overall, we developed _access to allow users to stop using the db-per-user architecture, but once we have per-doc-access control, folks might start using this for all manner of things. We should be clear about which scenarios we support and which we don’t.
> 
> 
> ### Scenario 1: db-per-user
> 
> In this scenario, _access enabled databases, the only way to allow mutually untrusting users to store data in a part of CouchDB that only they (and admins) have access to was giving each user their own database.
> 
> In an _access enabled database, users can CRUD/_changes/_all_docs/_revs_diff their own docs knowing no other user (aside from admins) can access those docs.
> 
> This is the simplest scenario, as all we’d have to track the owner of a document and produce by-access-id/seq indexes based on that owner.
> 
> The current prototype implementation mostly reflects this stage. Not saying this is what we should ship, but it is the easiest do implement and explain.
> 
> Aside, I might be able to be persuaded to ship this as a 2.x feature, to help those folks who don’t need anything else.
> 
> 
> ### Scenario 2: db-per-user + Sharing

One scenario we should address is how stopping to share would work when documents are continuously replicated, e.g. to a client for offline usage. My understanding is that for the person who’s access to documents got revoked does not get _changes update telling them that their access got removed, it would be up to the application developer to implement some kind of "notification" meta documents. Unless you have a better idea?

> 
> The second we allow per doc auth, users will want to share those docs with other users. That’s why we initially suggested the _access field be an array, so other users and groups can be specified to have access. There are multiple scenarios in this one alone:
> 
> #### 2.1: The Todo List
> 
> In this scenario, a user has a reasonable amount of ”personal data” that they want to selectively share with one or more other users.
> 
> #### 2.2: The Chat/Forum/Newsgroup
> 
> In this scenario, a user wants to share any number of documents with a reasonable number of groups. However, since we need to limit the number of groups a user belongs to (currently 10, see below for details), this might actually not be a great solution. Or folks couldn’t be in more than 10 chat groups at a time.
> 
> #### 2.3: The Corporate Hierarchy
> 
> In this scenario, users want to share any number of docs with a reasonable number of groups in a top-down/bottom-up fashion. Think CEO shares with executives, execs share with divisions, divisions report up to their one executive, etc.
> 
> 
> ### 3: Multiple Apps
> 
> The preceding scenarios all assume that a single application is responsible for everything. However, once we allow mutually distrusting users into a single database *and* make each per-user slice work (almost) like a full standalone CouchDB database, what would stop users from using this for a multi-homing feature, where different applications are used for each user in the same database?
> 
> I’ll be referring to these scenarios down the line.
> 
> * * *
> 
> 
> ## Design Docs
> 
> ### Admin
> 
> One of the downsides of db-per-user is managing design docs in the face of a changing application, that is, how to distribute new design docs across 10s of 1000+s of user dbs? It’s not impossible, but tedious. In all scenarios above but scenario 3., we could simplify this significantly. Say an admin creates a design doc, and gives all users in the db access to this design doc (this could be with the _users role, or yet another new role _members, if we need it), requesting the result of a view defined in that design doc will produce an index that is powered by the requesting user’s by-access-seq index section(s).
> 
> N.B., this would require us to change a fundamental assumption when doing the association between a design doc’s definition and index: normally, there is only the `views` member that is hashed and that hash is used as the index’s filename. Because there is only by-seq to power a view, that all works. But now that we have an arbitrary set of sections on by-access-seq, any view index built will have to take a user’s name and roles into account. When a user leaves a group, or gains a group, all indexes for that user will no longer be valid and need rebuilding.
> 
> 
> ### User
> 
> In any of the scenarios above, but especially 3., there could be legitimate per-user design docs, so how should those be treated in an _access enabled database?
> 
> The significant fields in a design doc are `views`, `validate_doc_update` and `filters` (I’ll skip over the deprecated _show, _list, and _update).
> 
> The easiest to handle is a `filters`: if a user specifies a filter for a _changes request or replication that lives in a design doc they don’t have access to, they get an error, similar to if they specify a non-existent design doc, just with `unauthorized` instead of `not_found`.
> 
> Next `views` is also not very hard to imagine working: just like globally defined views for that db, the index is built for each user based on the user’s name and roles.
> 
> More troubling are `validate_doc_update` functions: One, they are already troubling in that they slow down any document updates. Two, if we now import an existing db-per-user scenario where each user has their own design docs,

I can’t think of a db-per-user scenario where each user DB would have a different validate_doc_update method? It would be the same method with access to the user context, the DBs security setting and the document, so it would act differently for different users, but using the same code.

> how should we apply validate_doc_update functions? 10s of 1000s of VDUs are impractical to apply on each doc update, let alone just the management of VDUs that are active on a database. One option would be to ignore VDUs if they are not defined globally (say with a _members role). But especially in scenario 3. this becomes problematic, but even without that specific scenario, this violates the no surprises best practice.
> 
> We could say:
> 
> a) we don’t support scenario 3.

+1, I think it would make our lives easier in general if we don’t recommend to share the same CouchDB for multiple apps. At least I don’t see a reason to do that at this point

> b) we find a complicated but efficient way to apply only those VDUs that are defined in design docs the writing user has access to plus any global ones (this would be neat but rather complicated and potentially still impractical from a performance perspective for N users).
> c) we could store all per-user design docs, but ignore them completely, VDUs, views and filters.
> 
> I think I currently fall on the side of not supporting scenario 3. and asking folks who migrate db-per-user to de-duplicate design docs and keep them per-app. I believe that is a good trade-off between the most common scenarios for db-per-user while keeping the implementation manageable. Globally accessible design docs would show up in a user’s changes feed and would replicate down to say a PouchDB application which might be the exclusive user of those design docs.
> 
> In practice this would mean, a document that has an _id that starts with _design/ will have to be produced by a database admin. Luckily, that’s already the case. We should just make sure that folks don’t give db-admin access to all users habitually.
> 
> 
> ## Read and Write Access
> 
> Speaking of validate_doc_update, it is used for two things: checking document schema and doc update authorisation.
> 
> Once we allow access to a document with an _access field, we need to decide what kind of access this gives to a doc: read-only or read-write (I’m not considering write-only because for anything but doc creations this is not useful as you need access to the current _rev).
> 
> However, when we look at implementing an application on top of our existing API, it is already weird that read access can be controlled globally (or with _access on a per doc level), but write access requires writing JavaScript code. I think it would be a reasonable expectation for users to expect a per-doc read/write permission granting.

Yes!

> 
> So we could have all of the above, but with two extra fields: _access_read and _access_write, or _access: {read: [], write: []}

I prefer this API for its compactness, thinking about offline synchronization. The smaller the docs, the better.

Best
“Gregor”
—


> or we overload user and group names: _access: [user_a:read, user_b:write] (or any permutation thereof). Overloading can cause trouble with naturally occurring characters in group names.
> 
> The former seems more explicit, but from an API perspective that’s a little more awkward: remember that we currently have an arbitrary limit of 10 members in a user’s role array, to avoid excessive fan out on cluster-internal operations. Partitioned dbs could get away with more, more easily however. If we allow the specification of access control in two lists, and one of the lists implies membership in the other, we have a total limit of 10 members across both arrays. Or we limit 5 + 5, but that seems excessive, while 10 total seems weird, but doable. Anyway, good bikeshed.
> 
> 
> * * * 
> 
> 
> So far. I think all of the problems outlined are solvable, if with a clear definition of what use-cases we do not support with access. If you have more scenarios than the ones I outlined, please add them and we can see if they cause any additional trouble.
> 
> Thanks for reading this far and I’m looking forward to your feedback.
> 
> 
> Best,
> Jan “_access” Lehnardt
> —
> 
> 
> 
> 
>> On 17. Feb 2019, at 15:25, Jan Lehnardt <ja...@apache.org> wrote:
>> 
>> Hi Everyone,
>> 
>> I’m happy to share my work in progress attempt to implement the per-doc access control feature we discussed a good while ago:
>> 
>> https://lists.apache.org/thread.html/6aa77dd8e5974a3a540758c6902ccb509ab5a2e4802ecf4fd724a5e4@%3Cdev.couchdb.apache.org%3E <https://lists.apache.org/thread.html/6aa77dd8e5974a3a540758c6902ccb509ab5a2e4802ecf4fd724a5e4@%3Cdev.couchdb.apache.org%3E>
>> 
>> You can check out my branch here:
>> 
>> https://github.com/apache/couchdb/compare/access?expand=1 <https://github.com/apache/couchdb/compare/access?expand=1>
>> 
>> It is very much work in progress, but it is far enough along to warrant discussion.
>> 
>> The main point of this branch is to show all the places that we would need to change to support the proposal.
>> 
>> Things I’ve left for later:
>> 
>> - currently only the first element in the _access array is used. Our and/or syntax can be added later.
>> - building per-access views has not been implemented yet, couch_index would have to be taught about the new per-access-id index.
>> - pretty HTTP error handling
>> - tests except for a tiny shell script 😇
>> 
>> Implementation notes:
>> 
>> You create a database with the _access feature turned on like so:  PUT /db?access=true
>> 
>> I started out with storing _access in the document body, as that would allow for a minimal change set, however, on doc updates, we try hard not to load the old doc body from the database, and forcing us to do so for EVERY doc update under _access seemed prohibitive, so I extended the #doc, #doc_info and #full_doc_info records with a new `access` attribute that is stored in both by-id and by-seq. I will need guidance on how extending these records impact multi-version cluster interop. And especially whether this is an acceptable approach.
>> 
>> https://github.com/apache/couchdb/compare/access?expand=1&ws=0#diff-904ab7473ff8ddd07ea44aca414e3a36
>> 
>> * * *
>> 
>> The main addition is a new native query server called couch_access_native_proc, which implements two new indexes by-access-id and by-access-seq which do what you’d expect, pass in a userCtx and retrieve the equivalent of _all_docs or _changes, but only including those docs that match the username and roles in their _access property. The existing handlers for _all_docs and _changes have been augmented to use the new indexes instead of the default ones, unless the user is an admin.
>> 
>> https://github.com/apache/couchdb/compare/access?expand=1&ws=0#diff-fbb53323f07579be5e46ba63cb6701c4
>> 
>> * * *
>> 
>> The rest of the diff is concerned with making document CRUD behave as you’d expect it. See this little demonstration for what things look like:
>> 
>> https://gist.github.com/janl/b6d3f7502aa20b7b9ab9d9dcb8e92497 <https://gist.github.com/janl/b6d3f7502aa20b7b9ab9d9dcb8e92497> (I’m just noticing that there might be something wonky with DELETE, but you’ll get the gist #rimshot)
>> 
>> * * *
>> 
>> Open questions:
>> 
>> - The aim of this is to get as close to regular CouchDB behaviour as possible. One thing that is new however which would require all apps to be changed is that for an _access enabled database to include an _access field in their docs (docs with no _access are admin-only for now). We might want to consider on new document writes to auto-insert the authenticated user’s name as the first element in the _access array, so existing apps “just work”.
>> 
>> - Interplay with partitioned dbs: eschewing db-per-user is already a large boon if you have a lot of users, but making those per-user requests inside an _access enabled database efficient would be doubly nice, so why not use the username from the first question above and use that as the partition key? This would work nicely for natural users with their own docs that want to share them with others later, but I can easily imagine a pipelined use of CouchDB, where a “collector” user creates all new docs, an “analyser” takes them over and hand them to a “result” user for viewing. In that case, we’d violate the high-cardinality rule of partitions (have a lot of small ones), instead all docs go through all three users. I’d be okay with treating the later scenario as a minor use-case, but for that use-case, we should be able to disable auto-partitioning on db creation.
>> 
>> - building access view indexes for docs that have frequent _access changes, lead to many orphaned view indexes, we should look at an auto-cleanup solution here (maybe keep 1-N indexes in case folks just swap back and forth).
>> 
>> * * *
>> 
>> I’ll leave this here for now, I’m sure there are a few more things to consider.
>> 
>> I’d love to hear any and all feedback you might have. Especially if anything is unclear.
>> 
>> Best
>> Jan
>> —
> 
> -- 
> Professional Support for Apache CouchDB:
> https://neighbourhood.ie/couchdb-support/
> 

-- 
Professional Support for Apache CouchDB:
https://neighbourhood.ie/couchdb-support/