You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@couchdb.apache.org by Filipe David Manana <fd...@gmail.com> on 2010/02/27 19:32:05 UTC

/_all_dbs and security

Dear devs,

Currently, the URI handler for /_all_dbs just lists, recursively, all the db
files in the database dir (parameter database_dir of the .ini file).

Since he have now a _security object per DB (I dunno why it's not a regular
doc) which allows to restrict access to each DB, that code is no longer
fair. It makes sense that this handler just returns a list of the DBs an
user has access to.

It's through this URI that for example Futon lists the available DBs.

There's a ticket for this: https://issues.apache.org/jira/browse/COUCHDB-661

That solution is acceptable if the number of DBs in the server is "just" up
to about 10 000 or so. I tested with 7500 DBs, each occupying about 1Mb and
having 100 docs, and the response time for _all_dbs was about 4 seconds
(more details in the comments of that ticket).

The problem is that for each DB file found, one has to read its header and
then read its _security object to figure out if the session user can access
that DB. Therefore, we have 2 disk read operations for each DB file. 1
million DBs would imply 2 million disk reads.

Obviously an efficient solution for this would be to have a view which maps
users to DBs. I have an incomplete idea for this.
What I thought about is the following:

1) Having a special db, named "_dbs" (for example) which would contain meta
information about every available DB (like the meta tables in Oracle, SQL
Server, and so on).

2) That DB would contain a doc for each available DB. Each doc would contain
the reader names and roles associated to the corresponding DB (this is the
only kind of info we need for _all_dbs)

3) We would have a view, like Brian Candler suggested in a comment to that
ticket, that emits keys like:
    emit(['name',name],db)
    emit(['role',role],db)

4) For DBs with a _security object having empty lists for both the reader
names and reader roles, we would emit the special role "_public" for example

5) Whenever the _security object of a DB is updated, we would update the
corresponding reader names and roles in the _dbs DB.

I though of some issues (for which I don't have a solution) :

1)  If a user just copies DB files from elsewhere (another server or a
backup for e.g.) into the DBs directory, how do we detect them? Scanning for
all DB files at startup and taking proper action would be potentially slow.
Also, if a DB file is copied while CouchDB is running, I dunno how to detect
it. The only idea I have now is: Every time a DB file is opened (due to a
user request), we check if _dbs has a corresponding entry and if not we take
proper action

2) If a user deletes a DB file manually (i.e. rm db_file.couch), how to
detect it and remove the corresponding entry in _dbs?

3) If a user restores a DB file backup containing an old _security object,
we need to detect that and update the entry in _dbs. A way to do this would
be to store the DB update seq number in the corresponding doc at _dbs and
then using the same idea as in 1)

These are very preliminary ideas.

I would like to collect suggestions from all of you on how to implement this
efficiently and know if you can point out any other problems I haven't
thought about.

thanks

best regards,

-- 
Filipe David Manana,
fdmanana@gmail.com
PGP key - http://pgp.mit.edu:11371/pks/lookup?op=get&search=0xC569452B

"Reasonable men adapt themselves to the world.
Unreasonable men adapt the world to themselves.
That's why all progress depends on unreasonable men."

Re: /_all_dbs and security

Posted by Damien Katz <da...@apache.org>.
Yes, server admins.

-Damien

On Mar 1, 2010, at 7:50 AM, Filipe David Manana wrote:

> I assume your talking about server admins (those listed in the .ini file),
> right? If you're talking about DB admins, than the problem is the same
> (reading the _security object from each db file).
> 
> On Mon, Mar 1, 2010 at 4:45 PM, Damien Katz <da...@apache.org> wrote:
> 
>> I don't think end users need a list of dbs they can access on the server.
>> Think the simplest answer is to only support _all_dbs operation for admins
>> and be done.
>> 
>> -Damien
>> 
>> 
>> On Mar 1, 2010, at 1:40 AM, Brian Candler wrote:
>> 
>>> On Mon, Mar 01, 2010 at 09:55:46AM +0100, Filipe David Manana wrote:
>>>>> The reason for no storing _security as a doc is an optimization. So we
>>>>> extend that optimization, and have something like a  security_changed
>> event
>>>>> for a db, that the _dbs database can react to. The model isn't
>> different
>>>>> from subscribing to _changes, it'd just be a separate code path.
>>>> 
>>>> That's a good idea (both simple and more efficient).
>>>> 
>>>> The only issues left are the cases where the user adds a new DB file
>>>> (possibly coming from other server for e.g.) into the DB dir, deletes a
>> DB
>>>> file or replaces a DB file with an old version (a backup whose update
>> seq
>>>> number is from the past).
>>> ...
>>>> Do you think this would add too much overhead or it could be a somewhat
>>>> "light" approach? Or better, do you have a better idea for it?
>>> 
>>> Just as an idea, you could just turn this on its head. Suppose _dbs was
>> the
>>> primary source of information; then the _security record within the
>> database
>>> is just a cached copy of that.  It's easy enough to take a _changes feed
>>> from _dbs to update this cache.
>>> 
>>> But in that case, the cache would be better kept in RAM, rather than
>> within
>>> the database file.
>>> 
>>> After all, CouchDB already keeps a cache of open database filehandles,
>>> doesn't it?  So you could read the security information from a regular
>>> document (an "expensive" operation) when you open the database, and then
>>> just continue to use that version thereafter.  You'd invalidate the cache
>> if
>>> the corresponding _dbs object is updated.
>>> 
>>> However this leaves the following issue: what if the database file is
>>> renamed on disk, or disk-copied to a different system which happens to
>> have
>>> an existing _dbs entry for that name?
>>> 
>>> I think the best solution is for the _dbs database to be indexed by uuid.
>>> But to make this work efficiently, the database file *on disk* should
>> also
>>> be named by its uuid, rather than the database name.  That's probably too
>>> big a change to swallow at this stage.
>>> 
>>> But it does have some other side benefits (such as being able to "rename"
>> a
>>> database instantly without touching the filesystem, and being able to
>>> automatically spread a large number of databases across directories
>> without
>>> forcing the user to use database names like 00/xxx, 01/yyy, 02/zzz etc)
>>> 
>>> Regards,
>>> 
>>> Brian.
>> 
>> 
> 
> 
> -- 
> Filipe David Manana,
> fdmanana@gmail.com
> PGP key - http://pgp.mit.edu:11371/pks/lookup?op=get&search=0xC569452B
> 
> "Reasonable men adapt themselves to the world.
> Unreasonable men adapt the world to themselves.
> That's why all progress depends on unreasonable men."


Re: /_all_dbs and security

Posted by Mario Scheliga <ma...@sourcegarden.de>.
Why do you need this for DB-Admins? Is that an application-level  
requirement?
Why don't maintain those informations in an extra database?

greetz
mario

Am 01.03.2010 um 16:50 schrieb Filipe David Manana:

> I assume your talking about server admins (those listed in the .ini  
> file),
> right? If you're talking about DB admins, than the problem is the same
> (reading the _security object from each db file).
>
> On Mon, Mar 1, 2010 at 4:45 PM, Damien Katz <da...@apache.org> wrote:
>
>> I don't think end users need a list of dbs they can access on the  
>> server.
>> Think the simplest answer is to only support _all_dbs operation for  
>> admins
>> and be done.
>>
>> -Damien
>>
>>
>> On Mar 1, 2010, at 1:40 AM, Brian Candler wrote:
>>
>>> On Mon, Mar 01, 2010 at 09:55:46AM +0100, Filipe David Manana wrote:
>>>>> The reason for no storing _security as a doc is an optimization.  
>>>>> So we
>>>>> extend that optimization, and have something like a   
>>>>> security_changed
>> event
>>>>> for a db, that the _dbs database can react to. The model isn't
>> different
>>>>> from subscribing to _changes, it'd just be a separate code path.
>>>>
>>>> That's a good idea (both simple and more efficient).
>>>>
>>>> The only issues left are the cases where the user adds a new DB  
>>>> file
>>>> (possibly coming from other server for e.g.) into the DB dir,  
>>>> deletes a
>> DB
>>>> file or replaces a DB file with an old version (a backup whose  
>>>> update
>> seq
>>>> number is from the past).
>>> ...
>>>> Do you think this would add too much overhead or it could be a  
>>>> somewhat
>>>> "light" approach? Or better, do you have a better idea for it?
>>>
>>> Just as an idea, you could just turn this on its head. Suppose  
>>> _dbs was
>> the
>>> primary source of information; then the _security record within the
>> database
>>> is just a cached copy of that.  It's easy enough to take a  
>>> _changes feed
>>> from _dbs to update this cache.
>>>
>>> But in that case, the cache would be better kept in RAM, rather than
>> within
>>> the database file.
>>>
>>> After all, CouchDB already keeps a cache of open database  
>>> filehandles,
>>> doesn't it?  So you could read the security information from a  
>>> regular
>>> document (an "expensive" operation) when you open the database,  
>>> and then
>>> just continue to use that version thereafter.  You'd invalidate  
>>> the cache
>> if
>>> the corresponding _dbs object is updated.
>>>
>>> However this leaves the following issue: what if the database file  
>>> is
>>> renamed on disk, or disk-copied to a different system which  
>>> happens to
>> have
>>> an existing _dbs entry for that name?
>>>
>>> I think the best solution is for the _dbs database to be indexed  
>>> by uuid.
>>> But to make this work efficiently, the database file *on disk*  
>>> should
>> also
>>> be named by its uuid, rather than the database name.  That's  
>>> probably too
>>> big a change to swallow at this stage.
>>>
>>> But it does have some other side benefits (such as being able to  
>>> "rename"
>> a
>>> database instantly without touching the filesystem, and being able  
>>> to
>>> automatically spread a large number of databases across directories
>> without
>>> forcing the user to use database names like 00/xxx, 01/yyy, 02/zzz  
>>> etc)
>>>
>>> Regards,
>>>
>>> Brian.
>>
>>
>
>
> -- 
> Filipe David Manana,
> fdmanana@gmail.com
> PGP key - http://pgp.mit.edu:11371/pks/lookup?op=get&search=0xC569452B
>
> "Reasonable men adapt themselves to the world.
> Unreasonable men adapt the world to themselves.
> That's why all progress depends on unreasonable men."


--
Sourcegarden GmbH HR: B-104357
Steuernummer: 37/167/21214 USt-ID: DE814784953
Geschaeftsfuehrer: Mario Scheliga, Rene Otto
Bank: Deutsche Bank, BLZ: 10070024, KTO: 0810929
Schoenhauser Allee 51, 10437 Berlin


Re: /_all_dbs and security

Posted by Filipe David Manana <fd...@gmail.com>.
I assume your talking about server admins (those listed in the .ini file),
right? If you're talking about DB admins, than the problem is the same
(reading the _security object from each db file).

On Mon, Mar 1, 2010 at 4:45 PM, Damien Katz <da...@apache.org> wrote:

> I don't think end users need a list of dbs they can access on the server.
> Think the simplest answer is to only support _all_dbs operation for admins
> and be done.
>
> -Damien
>
>
> On Mar 1, 2010, at 1:40 AM, Brian Candler wrote:
>
> > On Mon, Mar 01, 2010 at 09:55:46AM +0100, Filipe David Manana wrote:
> >>> The reason for no storing _security as a doc is an optimization. So we
> >>> extend that optimization, and have something like a  security_changed
> event
> >>> for a db, that the _dbs database can react to. The model isn't
> different
> >>> from subscribing to _changes, it'd just be a separate code path.
> >>
> >> That's a good idea (both simple and more efficient).
> >>
> >> The only issues left are the cases where the user adds a new DB file
> >> (possibly coming from other server for e.g.) into the DB dir, deletes a
> DB
> >> file or replaces a DB file with an old version (a backup whose update
> seq
> >> number is from the past).
> > ...
> >> Do you think this would add too much overhead or it could be a somewhat
> >> "light" approach? Or better, do you have a better idea for it?
> >
> > Just as an idea, you could just turn this on its head. Suppose _dbs was
> the
> > primary source of information; then the _security record within the
> database
> > is just a cached copy of that.  It's easy enough to take a _changes feed
> > from _dbs to update this cache.
> >
> > But in that case, the cache would be better kept in RAM, rather than
> within
> > the database file.
> >
> > After all, CouchDB already keeps a cache of open database filehandles,
> > doesn't it?  So you could read the security information from a regular
> > document (an "expensive" operation) when you open the database, and then
> > just continue to use that version thereafter.  You'd invalidate the cache
> if
> > the corresponding _dbs object is updated.
> >
> > However this leaves the following issue: what if the database file is
> > renamed on disk, or disk-copied to a different system which happens to
> have
> > an existing _dbs entry for that name?
> >
> > I think the best solution is for the _dbs database to be indexed by uuid.
> > But to make this work efficiently, the database file *on disk* should
> also
> > be named by its uuid, rather than the database name.  That's probably too
> > big a change to swallow at this stage.
> >
> > But it does have some other side benefits (such as being able to "rename"
> a
> > database instantly without touching the filesystem, and being able to
> > automatically spread a large number of databases across directories
> without
> > forcing the user to use database names like 00/xxx, 01/yyy, 02/zzz etc)
> >
> > Regards,
> >
> > Brian.
>
>


-- 
Filipe David Manana,
fdmanana@gmail.com
PGP key - http://pgp.mit.edu:11371/pks/lookup?op=get&search=0xC569452B

"Reasonable men adapt themselves to the world.
Unreasonable men adapt the world to themselves.
That's why all progress depends on unreasonable men."

Re: /_all_dbs and security

Posted by Damien Katz <da...@apache.org>.
I don't think end users need a list of dbs they can access on the server. Think the simplest answer is to only support _all_dbs operation for admins and be done.

-Damien


On Mar 1, 2010, at 1:40 AM, Brian Candler wrote:

> On Mon, Mar 01, 2010 at 09:55:46AM +0100, Filipe David Manana wrote:
>>> The reason for no storing _security as a doc is an optimization. So we
>>> extend that optimization, and have something like a  security_changed event
>>> for a db, that the _dbs database can react to. The model isn't different
>>> from subscribing to _changes, it'd just be a separate code path.
>> 
>> That's a good idea (both simple and more efficient).
>> 
>> The only issues left are the cases where the user adds a new DB file
>> (possibly coming from other server for e.g.) into the DB dir, deletes a DB
>> file or replaces a DB file with an old version (a backup whose update seq
>> number is from the past).
> ...
>> Do you think this would add too much overhead or it could be a somewhat
>> "light" approach? Or better, do you have a better idea for it?
> 
> Just as an idea, you could just turn this on its head. Suppose _dbs was the
> primary source of information; then the _security record within the database
> is just a cached copy of that.  It's easy enough to take a _changes feed
> from _dbs to update this cache.
> 
> But in that case, the cache would be better kept in RAM, rather than within
> the database file.
> 
> After all, CouchDB already keeps a cache of open database filehandles,
> doesn't it?  So you could read the security information from a regular
> document (an "expensive" operation) when you open the database, and then
> just continue to use that version thereafter.  You'd invalidate the cache if
> the corresponding _dbs object is updated.
> 
> However this leaves the following issue: what if the database file is
> renamed on disk, or disk-copied to a different system which happens to have
> an existing _dbs entry for that name?
> 
> I think the best solution is for the _dbs database to be indexed by uuid. 
> But to make this work efficiently, the database file *on disk* should also
> be named by its uuid, rather than the database name.  That's probably too
> big a change to swallow at this stage.
> 
> But it does have some other side benefits (such as being able to "rename" a
> database instantly without touching the filesystem, and being able to
> automatically spread a large number of databases across directories without
> forcing the user to use database names like 00/xxx, 01/yyy, 02/zzz etc)
> 
> Regards,
> 
> Brian.


Re: /_all_dbs and security

Posted by Brian Candler <B....@pobox.com>.
On Mon, Mar 01, 2010 at 09:55:46AM +0100, Filipe David Manana wrote:
> > The reason for no storing _security as a doc is an optimization. So we
> > extend that optimization, and have something like a  security_changed event
> > for a db, that the _dbs database can react to. The model isn't different
> > from subscribing to _changes, it'd just be a separate code path.
> 
> That's a good idea (both simple and more efficient).
> 
> The only issues left are the cases where the user adds a new DB file
> (possibly coming from other server for e.g.) into the DB dir, deletes a DB
> file or replaces a DB file with an old version (a backup whose update seq
> number is from the past).
...
> Do you think this would add too much overhead or it could be a somewhat
> "light" approach? Or better, do you have a better idea for it?

Just as an idea, you could just turn this on its head. Suppose _dbs was the
primary source of information; then the _security record within the database
is just a cached copy of that.  It's easy enough to take a _changes feed
from _dbs to update this cache.

But in that case, the cache would be better kept in RAM, rather than within
the database file.

After all, CouchDB already keeps a cache of open database filehandles,
doesn't it?  So you could read the security information from a regular
document (an "expensive" operation) when you open the database, and then
just continue to use that version thereafter.  You'd invalidate the cache if
the corresponding _dbs object is updated.

However this leaves the following issue: what if the database file is
renamed on disk, or disk-copied to a different system which happens to have
an existing _dbs entry for that name?

I think the best solution is for the _dbs database to be indexed by uuid. 
But to make this work efficiently, the database file *on disk* should also
be named by its uuid, rather than the database name.  That's probably too
big a change to swallow at this stage.

But it does have some other side benefits (such as being able to "rename" a
database instantly without touching the filesystem, and being able to
automatically spread a large number of databases across directories without
forcing the user to use database names like 00/xxx, 01/yyy, 02/zzz etc)

Regards,

Brian.

Re: /_all_dbs and security

Posted by Filipe David Manana <fd...@gmail.com>.
> The reason for no storing _security as a doc is an optimization. So we
> extend that optimization, and have something like a  security_changed event
> for a db, that the _dbs database can react to. The model isn't different
> from subscribing to _changes, it'd just be a separate code path.
>
>  Chris
>

That's a good idea (both simple and more efficient).

The only issues left are the cases where the user adds a new DB file
(possibly coming from other server for e.g.) into the DB dir, deletes a DB
file or replaces a DB file with an old version (a backup whose update seq
number is from the past).

I thought of having a separate process that from time to time looks into FS
contents. But this would likely be too heavy for the case of millions of
DBs.

On the other hand, if absolute consistency is not a must, the detection of a
new DB, could be done the first time a request is processed (like GET
/somedb  GET /somedb/somedoc etc). Opening a DB would trigger an event
"db_opened" which would cause a process to verify if that DB is listed in
_dbs and if the DB's update seq number matches the one in _dbs (useful for
the case where an old db file is restored). Similar approach when opening a
DB fails because the DB file doesn't exist.

Do you think this would add too much overhead or it could be a somewhat
"light" approach? Or better, do you have a better idea for it?

cheers


>
> > cheers
> >
> >
> > --
> > Filipe David Manana,
> > fdmanana@gmail.com
> > PGP key - http://pgp.mit.edu:11371/pks/lookup?op=get&search=0xC569452B
> >
> > "Reasonable men adapt themselves to the world.
> > Unreasonable men adapt the world to themselves.
> > That's why all progress depends on unreasonable men."
>
>


-- 
Filipe David Manana,
fdmanana@gmail.com
PGP key - http://pgp.mit.edu:11371/pks/lookup?op=get&search=0xC569452B

"Reasonable men adapt themselves to the world.
Unreasonable men adapt the world to themselves.
That's why all progress depends on unreasonable men."

Re: /_all_dbs and security

Posted by Filipe David Manana <fd...@gmail.com>.
> I'm not too happy about "half baked" solutions here where we go out
> of our way to avoid them in other places.
>
> If we attempt to make this work, we should get it right and not have
> some invisible performance drop-off.
>
> I think _dbs is the way to go, but we still need to vet it for design
> issues.
>

Because it's not a complete solution or because no new features or design
changes can be added until 1.0? I was thinking in implementing it for some
release after 1.0



>
> Cheers
> Jan
> --
>
>
> >
> > Chris
> >
> >> cheers
> >>
> >>
> >> --
> >> Filipe David Manana,
> >> fdmanana@gmail.com
> >> PGP key - http://pgp.mit.edu:11371/pks/lookup?op=get&search=0xC569452B
> >>
> >> "Reasonable men adapt themselves to the world.
> >> Unreasonable men adapt the world to themselves.
> >> That's why all progress depends on unreasonable men."
> >
>
>


-- 
Filipe David Manana,
fdmanana@gmail.com
PGP key - http://pgp.mit.edu:11371/pks/lookup?op=get&search=0xC569452B

"Reasonable men adapt themselves to the world.
Unreasonable men adapt the world to themselves.
That's why all progress depends on unreasonable men."

Re: /_all_dbs and security

Posted by Jan Lehnardt <ja...@apache.org>.
On 28 Feb 2010, at 09:39, J Chris Anderson wrote:

> 
> On Feb 28, 2010, at 7:42 AM, Filipe David Manana wrote:
> 
>>> this is the best reason I've heard for making it a security document. I
>>> wonder how much slower the 7.5k dbs scan proceeds when it has to look up
>>> documents instead of linked objects? do you mind adding a doc-read to the
>>> tight loop just to see what it does to performance?
>>> 
>> 
>> $ time curl http://localhost:5984/_all_dbs | wc -l
>> % Total    % Received % Xferd  Average Speed   Time    Time     Time
>> Current
>>                                Dload  Upload   Total   Spent    Left
>> Speed
>> 100  100k    0  100k    0     0    576      0 --:--:--  0:02:57 --:--:--
>> 555
>> 7397
>> 
>> real    2m57.811s
>> user    0m0.000s
>> sys    0m0.020s
>> 
>> Alot more as expected.
>> 
>> 
>>> 
>>> the 7.5k thing isn't important once we have a _dbs db, but the cost it will
>>> expose as a benchmark will be proportional to the cost incurred on opening
>>> any db for any operation, and thus significant.
>>> 
>> 
>> True. Traversing the b-tree of _dbs to find if a particular doc exists, and
>> then eventually insert it (which could imply re-balancing  the tree from
>> time to time) would take several disk accesses.
>> 
>> Any other ideas on how to implement _all_dbs efficiently?
>> 
> 
> The reason for no storing _security as a doc is an optimization. So we extend that optimization, and have something like a  security_changed event for a db, that the _dbs database can react to. The model isn't different from subscribing to _changes, it'd just be a separate code path.
> 
> I think we can add your filter to _all_dbs, it seems "fast enough". People with millions of dbs will still need to do something special in front of couch, but they probably won't be changing their whole system to use the Reader ACLs anyway, as they already have a bunch of namespaced db-names.

I'm not too happy about "half baked" solutions here where we go out
of our way to avoid them in other places.

If we attempt to make this work, we should get it right and not have
some invisible performance drop-off.

I think _dbs is the way to go, but we still need to vet it for design
issues.

Cheers
Jan
--


> 
> Chris
> 
>> cheers
>> 
>> 
>> -- 
>> Filipe David Manana,
>> fdmanana@gmail.com
>> PGP key - http://pgp.mit.edu:11371/pks/lookup?op=get&search=0xC569452B
>> 
>> "Reasonable men adapt themselves to the world.
>> Unreasonable men adapt the world to themselves.
>> That's why all progress depends on unreasonable men."
> 


Re: /_all_dbs and security

Posted by J Chris Anderson <jc...@gmail.com>.
On Feb 28, 2010, at 7:42 AM, Filipe David Manana wrote:

>> this is the best reason I've heard for making it a security document. I
>> wonder how much slower the 7.5k dbs scan proceeds when it has to look up
>> documents instead of linked objects? do you mind adding a doc-read to the
>> tight loop just to see what it does to performance?
>> 
> 
> $ time curl http://localhost:5984/_all_dbs | wc -l
>  % Total    % Received % Xferd  Average Speed   Time    Time     Time
> Current
>                                 Dload  Upload   Total   Spent    Left
> Speed
> 100  100k    0  100k    0     0    576      0 --:--:--  0:02:57 --:--:--
> 555
> 7397
> 
> real    2m57.811s
> user    0m0.000s
> sys    0m0.020s
> 
> Alot more as expected.
> 
> 
>> 
>> the 7.5k thing isn't important once we have a _dbs db, but the cost it will
>> expose as a benchmark will be proportional to the cost incurred on opening
>> any db for any operation, and thus significant.
>> 
> 
> True. Traversing the b-tree of _dbs to find if a particular doc exists, and
> then eventually insert it (which could imply re-balancing  the tree from
> time to time) would take several disk accesses.
> 
> Any other ideas on how to implement _all_dbs efficiently?
> 

The reason for no storing _security as a doc is an optimization. So we extend that optimization, and have something like a  security_changed event for a db, that the _dbs database can react to. The model isn't different from subscribing to _changes, it'd just be a separate code path.

I think we can add your filter to _all_dbs, it seems "fast enough". People with millions of dbs will still need to do something special in front of couch, but they probably won't be changing their whole system to use the Reader ACLs anyway, as they already have a bunch of namespaced db-names.

Chris

> cheers
> 
> 
> -- 
> Filipe David Manana,
> fdmanana@gmail.com
> PGP key - http://pgp.mit.edu:11371/pks/lookup?op=get&search=0xC569452B
> 
> "Reasonable men adapt themselves to the world.
> Unreasonable men adapt the world to themselves.
> That's why all progress depends on unreasonable men."


Re: /_all_dbs and security

Posted by Filipe David Manana <fd...@gmail.com>.
> this is the best reason I've heard for making it a security document. I
> wonder how much slower the 7.5k dbs scan proceeds when it has to look up
> documents instead of linked objects? do you mind adding a doc-read to the
> tight loop just to see what it does to performance?
>

$ time curl http://localhost:5984/_all_dbs | wc -l
  % Total    % Received % Xferd  Average Speed   Time    Time     Time
Current
                                 Dload  Upload   Total   Spent    Left
Speed
100  100k    0  100k    0     0    576      0 --:--:--  0:02:57 --:--:--
555
7397

real    2m57.811s
user    0m0.000s
sys    0m0.020s

Alot more as expected.


>
> the 7.5k thing isn't important once we have a _dbs db, but the cost it will
> expose as a benchmark will be proportional to the cost incurred on opening
> any db for any operation, and thus significant.
>

True. Traversing the b-tree of _dbs to find if a particular doc exists, and
then eventually insert it (which could imply re-balancing  the tree from
time to time) would take several disk accesses.

Any other ideas on how to implement _all_dbs efficiently?

cheers


-- 
Filipe David Manana,
fdmanana@gmail.com
PGP key - http://pgp.mit.edu:11371/pks/lookup?op=get&search=0xC569452B

"Reasonable men adapt themselves to the world.
Unreasonable men adapt the world to themselves.
That's why all progress depends on unreasonable men."

Re: /_all_dbs and security

Posted by J Chris Anderson <jc...@gmail.com>.
On Feb 27, 2010, at 10:32 AM, Filipe David Manana wrote:

> Dear devs,
> 
> Currently, the URI handler for /_all_dbs just lists, recursively, all the db
> files in the database dir (parameter database_dir of the .ini file).
> 
> Since he have now a _security object per DB (I dunno why it's not a regular
> doc) which allows to restrict access to each DB, that code is no longer
> fair. It makes sense that this handler just returns a list of the DBs an
> user has access to.
> 
> It's through this URI that for example Futon lists the available DBs.
> 
> There's a ticket for this: https://issues.apache.org/jira/browse/COUCHDB-661
> 
> That solution is acceptable if the number of DBs in the server is "just" up
> to about 10 000 or so. I tested with 7500 DBs, each occupying about 1Mb and
> having 100 docs, and the response time for _all_dbs was about 4 seconds
> (more details in the comments of that ticket).
> 
> The problem is that for each DB file found, one has to read its header and
> then read its _security object to figure out if the session user can access
> that DB. Therefore, we have 2 disk read operations for each DB file. 1
> million DBs would imply 2 million disk reads.
> 
> Obviously an efficient solution for this would be to have a view which maps
> users to DBs. I have an incomplete idea for this.
> What I thought about is the following:
> 
> 1) Having a special db, named "_dbs" (for example) which would contain meta
> information about every available DB (like the meta tables in Oracle, SQL
> Server, and so on).
> 
> 2) That DB would contain a doc for each available DB. Each doc would contain
> the reader names and roles associated to the corresponding DB (this is the
> only kind of info we need for _all_dbs)
> 
> 3) We would have a view, like Brian Candler suggested in a comment to that
> ticket, that emits keys like:
>    emit(['name',name],db)
>    emit(['role',role],db)
> 
> 4) For DBs with a _security object having empty lists for both the reader
> names and reader roles, we would emit the special role "_public" for example
> 
> 5) Whenever the _security object of a DB is updated, we would update the
> corresponding reader names and roles in the _dbs DB.
> 

this is the best reason I've heard for making it a security document. I wonder how much slower the 7.5k dbs scan proceeds when it has to look up documents instead of linked objects? do you mind adding a doc-read to the tight loop just to see what it does to performance?

the 7.5k thing isn't important once we have a _dbs db, but the cost it will expose as a benchmark will be proportional to the cost incurred on opening any db for any operation, and thus significant.



> I though of some issues (for which I don't have a solution) :
> 
> 1)  If a user just copies DB files from elsewhere (another server or a
> backup for e.g.) into the DBs directory, how do we detect them? Scanning for
> all DB files at startup and taking proper action would be potentially slow.
> Also, if a DB file is copied while CouchDB is running, I dunno how to detect
> it. The only idea I have now is: Every time a DB file is opened (due to a
> user request), we check if _dbs has a corresponding entry and if not we take
> proper action
> 
> 2) If a user deletes a DB file manually (i.e. rm db_file.couch), how to
> detect it and remove the corresponding entry in _dbs?
> 
> 3) If a user restores a DB file backup containing an old _security object,
> we need to detect that and update the entry in _dbs. A way to do this would
> be to store the DB update seq number in the corresponding doc at _dbs and
> then using the same idea as in 1)
> 
> These are very preliminary ideas.
> 
> I would like to collect suggestions from all of you on how to implement this
> efficiently and know if you can point out any other problems I haven't
> thought about.
> 
> thanks
> 
> best regards,
> 
> -- 
> Filipe David Manana,
> fdmanana@gmail.com
> PGP key - http://pgp.mit.edu:11371/pks/lookup?op=get&search=0xC569452B
> 
> "Reasonable men adapt themselves to the world.
> Unreasonable men adapt the world to themselves.
> That's why all progress depends on unreasonable men."