You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@couchdb.apache.org by Damien Katz <da...@gmail.com> on 2008/04/28 18:27:40 UTC

CouchDB 1.0 work

Here are my thoughts on what we need for before we can get to CouchDB  
1.0. Feedback please.

Must have:

Incremental reduce: Maybe single biggest outstanding work item.  
Probably 2 weeks of development to get to a testable state

Security/Document validation: We need a way to control who can update  
what documents and to validate the updates are correct. This is  
absolutely necessary for offline replication, where replicated updates  
to the database do not come through the application layer.

View index compaction/management: View indexes currently just grow,  
need a compaction similar to storage compaction. Also, there is no way  
to purge old unused indexes, except via the OS.

File sync problem: file:sync(), a call that flushes all uncommitted  
writes to disk before returning, doesn't work fully or at all on all  
some platforms (usually we just lack the flags to tell the OS to write  
to disk). Should be fixable by either patching the existing Erlang  
driver source, or using a replacement file driver.

Optimizations. Right now HTTP overhead is huge, with HTTP latency/ 
overhead at about 80% of  our document read time when loaded from  
local client (same machine). Once we can get this down to below 50%,  
we'll focus on optimizing the database and other component. Most core  
database operations, document reads, updates and view indexing are  
completely unoptimized so far, which the update speed being the  
biggest complaint.

Testing: We need lots more tests. By the time we ship 1.0, we should  
have far more test suite code than production code. And we need to do  
load testing. Will the current browser based test suite can scale for  
this kind of heavy testing?

Nice to have:

Plugs in: Erlang module plug-in architecture, to make adding new  
server side code easy. Right now the code that maps special urls  
(_view, _compact, _search, etc) to the appropriate Erlang call is  
messy and convoluted, and getting worse as we go. We need a standard  
way to map the special urls to the appropriate Erlang call.

Tail committed database headers: To optimize the updating of database  
by reducing the number and length of seeks required, the file header  
should be written to the end of the file, rather than the beginning.  
Depending on platform this can remove a full headseek and in the best  
case scenario a document insert/update can require zero head seeks (if  
the head is already positioned at the end of the file). But this can  
slow file opening speed as it may need to do a search in the file for  
the most recent valid header. In the result of a crash, the header  
scan/search cost at database open can be linear or logarithmic,  
depending on the exact implementation.

Clustering: The ability to cluster CouchDB servers, to increase both  
reliability (failover-clustering) and client scalability (more servers  
to handle more concurrent user load). Clustering does not increase  
data scalability, which is  (that's partitioning/sharding).

Selective document purging/compaction: Deletion stubs are kept around  
for replication purposes. Need a way to purge the records of document  
that are old or deleted.

Revision rev path pruning: Each document keeps a list of all previous  
revisions. We need a way to prune the oldest records of document  
revisions and remerge pruned lists during replication.

Don't Need:

Authentication. We can go to 1.0 without authentication, relying  
instead on local proxies to provide authentication.

Partioning. Partitioning is a big project with lots of considerations.  
It's best to move this post 1.0.

Re: CouchDB 1.0 work

Posted by Christopher Lenz <cm...@gmx.de>.

On 28.04.2008, at 18:27, Damien Katz wrote:
> Here are my thoughts on what we need for before we can get to  
> CouchDB 1.0. Feedback please.
>
> Must have:

One thing that I think should be on this list is the intersection of  
full text search results and view results. As long as we don't have  
that (or something similar/better), the search engine integration is  
fairly limited in use in my opinion.

Cheers,
--
Christopher Lenz
   cmlenz at gmx.de
   http://www.cmlenz.net/

Re: CouchDB 1.0 work

Posted by David Zülke <dz...@bitxtender.com>.

Another thing: reliable versioning (we talked about this on IRC).

Basically, at the moment, old revisions might get lost during  
compaction or whatever.
You mentioned you were planning a relatively trivial change where on  
an update, the old document becomes an attachment, so you'd  
essentially have a list of old document revisions.

Just thought I'd bring that up ;)

Stay strong, great work so far,

David


P.S: any plans on a timeline for 1.0? 2008? 2009? :)



Am 28.04.2008 um 18:27 schrieb Damien Katz:

> Here are my thoughts on what we need for before we can get to  
> CouchDB 1.0. Feedback please.
>
> Must have:
>
> Incremental reduce: Maybe single biggest outstanding work item.  
> Probably 2 weeks of development to get to a testable state
>
> Security/Document validation: We need a way to control who can  
> update what documents and to validate the updates are correct. This  
> is absolutely necessary for offline replication, where replicated  
> updates to the database do not come through the application layer.
>
> View index compaction/management: View indexes currently just grow,  
> need a compaction similar to storage compaction. Also, there is no  
> way to purge old unused indexes, except via the OS.
>
> File sync problem: file:sync(), a call that flushes all uncommitted  
> writes to disk before returning, doesn't work fully or at all on all  
> some platforms (usually we just lack the flags to tell the OS to  
> write to disk). Should be fixable by either patching the existing  
> Erlang driver source, or using a replacement file driver.
>
> Optimizations. Right now HTTP overhead is huge, with HTTP latency/ 
> overhead at about 80% of  our document read time when loaded from  
> local client (same machine). Once we can get this down to below 50%,  
> we'll focus on optimizing the database and other component. Most  
> core database operations, document reads, updates and view indexing  
> are completely unoptimized so far, which the update speed being the  
> biggest complaint.
>
> Testing: We need lots more tests. By the time we ship 1.0, we should  
> have far more test suite code than production code. And we need to  
> do load testing. Will the current browser based test suite can scale  
> for this kind of heavy testing?
>
> Nice to have:
>
> Plugs in: Erlang module plug-in architecture, to make adding new  
> server side code easy. Right now the code that maps special urls  
> (_view, _compact, _search, etc) to the appropriate Erlang call is  
> messy and convoluted, and getting worse as we go. We need a standard  
> way to map the special urls to the appropriate Erlang call.
>
> Tail committed database headers: To optimize the updating of  
> database by reducing the number and length of seeks required, the  
> file header should be written to the end of the file, rather than  
> the beginning. Depending on platform this can remove a full headseek  
> and in the best case scenario a document insert/update can require  
> zero head seeks (if the head is already positioned at the end of the  
> file). But this can slow file opening speed as it may need to do a  
> search in the file for the most recent valid header. In the result  
> of a crash, the header scan/search cost at database open can be  
> linear or logarithmic, depending on the exact implementation.
>
> Clustering: The ability to cluster CouchDB servers, to increase both  
> reliability (failover-clustering) and client scalability (more  
> servers to handle more concurrent user load). Clustering does not  
> increase data scalability, which is  (that's partitioning/sharding).
>
> Selective document purging/compaction: Deletion stubs are kept  
> around for replication purposes. Need a way to purge the records of  
> document that are old or deleted.
>
> Revision rev path pruning: Each document keeps a list of all  
> previous revisions. We need a way to prune the oldest records of  
> document revisions and remerge pruned lists during replication.
>
> Don't Need:
>
> Authentication. We can go to 1.0 without authentication, relying  
> instead on local proxies to provide authentication.
>
> Partioning. Partitioning is a big project with lots of  
> considerations. It's best to move this post 1.0.
>

Re: CouchDB 1.0 work

Posted by Ted Leung <tw...@sauria.com>.

Ok great - I missed that part of the thread I guess.

On Apr 30, 2008, at 1:19 PM, Jan Lehnardt wrote:

> Heya Ted,
> we definitely want to do a 0.8.0 before going 1.0.
> See http://mail-archives.apache.org/mod_mbox/incubator-couchdb-dev/200804.mbox/%3c3651E215-81DC-44A1-93A2-8DF6BD9378A1@gmail.com%3e 
>  ff.
> for details.
>
> Summary: Wait for cmlenz to get back home :)
>
> Cheers
> Jan
> --
>
> On Apr 30, 2008, at 22:11, Ted Leung wrote:
>> What about trying to make a 0.8 release from the ASF repository?   
>> Or would you rather do this starting at 1.0?
>>
>> Ted
>>
>> On Apr 28, 2008, at 9:27 AM, Damien Katz wrote:
>>
>>> Here are my thoughts on what we need for before we can get to  
>>> CouchDB 1.0. Feedback please.
>>>
>>> Must have:
>>>
>>> Incremental reduce: Maybe single biggest outstanding work item.  
>>> Probably 2 weeks of development to get to a testable state
>>>
>>> Security/Document validation: We need a way to control who can  
>>> update what documents and to validate the updates are correct.  
>>> This is absolutely necessary for offline replication, where  
>>> replicated updates to the database do not come through the  
>>> application layer.
>>>
>>> View index compaction/management: View indexes currently just  
>>> grow, need a compaction similar to storage compaction. Also, there  
>>> is no way to purge old unused indexes, except via the OS.
>>>
>>> File sync problem: file:sync(), a call that flushes all  
>>> uncommitted writes to disk before returning, doesn't work fully or  
>>> at all on all some platforms (usually we just lack the flags to  
>>> tell the OS to write to disk). Should be fixable by either  
>>> patching the existing Erlang driver source, or using a replacement  
>>> file driver.
>>>
>>> Optimizations. Right now HTTP overhead is huge, with HTTP latency/ 
>>> overhead at about 80% of  our document read time when loaded from  
>>> local client (same machine). Once we can get this down to below  
>>> 50%, we'll focus on optimizing the database and other component.  
>>> Most core database operations, document reads, updates and view  
>>> indexing are completely unoptimized so far, which the update speed  
>>> being the biggest complaint.
>>>
>>> Testing: We need lots more tests. By the time we ship 1.0, we  
>>> should have far more test suite code than production code. And we  
>>> need to do load testing. Will the current browser based test suite  
>>> can scale for this kind of heavy testing?
>>>
>>> Nice to have:
>>>
>>> Plugs in: Erlang module plug-in architecture, to make adding new  
>>> server side code easy. Right now the code that maps special urls  
>>> (_view, _compact, _search, etc) to the appropriate Erlang call is  
>>> messy and convoluted, and getting worse as we go. We need a  
>>> standard way to map the special urls to the appropriate Erlang call.
>>>
>>> Tail committed database headers: To optimize the updating of  
>>> database by reducing the number and length of seeks required, the  
>>> file header should be written to the end of the file, rather than  
>>> the beginning. Depending on platform this can remove a full  
>>> headseek and in the best case scenario a document insert/update  
>>> can require zero head seeks (if the head is already positioned at  
>>> the end of the file). But this can slow file opening speed as it  
>>> may need to do a search in the file for the most recent valid  
>>> header. In the result of a crash, the header scan/search cost at  
>>> database open can be linear or logarithmic, depending on the exact  
>>> implementation.
>>>
>>> Clustering: The ability to cluster CouchDB servers, to increase  
>>> both reliability (failover-clustering) and client scalability  
>>> (more servers to handle more concurrent user load). Clustering  
>>> does not increase data scalability, which is  (that's partitioning/ 
>>> sharding).
>>>
>>> Selective document purging/compaction: Deletion stubs are kept  
>>> around for replication purposes. Need a way to purge the records  
>>> of document that are old or deleted.
>>>
>>> Revision rev path pruning: Each document keeps a list of all  
>>> previous revisions. We need a way to prune the oldest records of  
>>> document revisions and remerge pruned lists during replication.
>>>
>>> Don't Need:
>>>
>>> Authentication. We can go to 1.0 without authentication, relying  
>>> instead on local proxies to provide authentication.
>>>
>>> Partioning. Partitioning is a big project with lots of  
>>> considerations. It's best to move this post 1.0.
>>
>>

Re: CouchDB 1.0 work

Posted by Jan Lehnardt <ja...@apache.org>.

Heya Ted,
we definitely want to do a 0.8.0 before going 1.0.
See http://mail-archives.apache.org/mod_mbox/incubator-couchdb-dev/200804.mbox/%3c3651E215-81DC-44A1-93A2-8DF6BD9378A1@gmail.com%3e 
  ff.
for details.

Summary: Wait for cmlenz to get back home :)

Cheers
Jan
--

On Apr 30, 2008, at 22:11, Ted Leung wrote:
> What about trying to make a 0.8 release from the ASF repository?  Or  
> would you rather do this starting at 1.0?
>
> Ted
>
> On Apr 28, 2008, at 9:27 AM, Damien Katz wrote:
>
>> Here are my thoughts on what we need for before we can get to  
>> CouchDB 1.0. Feedback please.
>>
>> Must have:
>>
>> Incremental reduce: Maybe single biggest outstanding work item.  
>> Probably 2 weeks of development to get to a testable state
>>
>> Security/Document validation: We need a way to control who can  
>> update what documents and to validate the updates are correct. This  
>> is absolutely necessary for offline replication, where replicated  
>> updates to the database do not come through the application layer.
>>
>> View index compaction/management: View indexes currently just grow,  
>> need a compaction similar to storage compaction. Also, there is no  
>> way to purge old unused indexes, except via the OS.
>>
>> File sync problem: file:sync(), a call that flushes all uncommitted  
>> writes to disk before returning, doesn't work fully or at all on  
>> all some platforms (usually we just lack the flags to tell the OS  
>> to write to disk). Should be fixable by either patching the  
>> existing Erlang driver source, or using a replacement file driver.
>>
>> Optimizations. Right now HTTP overhead is huge, with HTTP latency/ 
>> overhead at about 80% of  our document read time when loaded from  
>> local client (same machine). Once we can get this down to below  
>> 50%, we'll focus on optimizing the database and other component.  
>> Most core database operations, document reads, updates and view  
>> indexing are completely unoptimized so far, which the update speed  
>> being the biggest complaint.
>>
>> Testing: We need lots more tests. By the time we ship 1.0, we  
>> should have far more test suite code than production code. And we  
>> need to do load testing. Will the current browser based test suite  
>> can scale for this kind of heavy testing?
>>
>> Nice to have:
>>
>> Plugs in: Erlang module plug-in architecture, to make adding new  
>> server side code easy. Right now the code that maps special urls  
>> (_view, _compact, _search, etc) to the appropriate Erlang call is  
>> messy and convoluted, and getting worse as we go. We need a  
>> standard way to map the special urls to the appropriate Erlang call.
>>
>> Tail committed database headers: To optimize the updating of  
>> database by reducing the number and length of seeks required, the  
>> file header should be written to the end of the file, rather than  
>> the beginning. Depending on platform this can remove a full  
>> headseek and in the best case scenario a document insert/update can  
>> require zero head seeks (if the head is already positioned at the  
>> end of the file). But this can slow file opening speed as it may  
>> need to do a search in the file for the most recent valid header.  
>> In the result of a crash, the header scan/search cost at database  
>> open can be linear or logarithmic, depending on the exact  
>> implementation.
>>
>> Clustering: The ability to cluster CouchDB servers, to increase  
>> both reliability (failover-clustering) and client scalability (more  
>> servers to handle more concurrent user load). Clustering does not  
>> increase data scalability, which is  (that's partitioning/sharding).
>>
>> Selective document purging/compaction: Deletion stubs are kept  
>> around for replication purposes. Need a way to purge the records of  
>> document that are old or deleted.
>>
>> Revision rev path pruning: Each document keeps a list of all  
>> previous revisions. We need a way to prune the oldest records of  
>> document revisions and remerge pruned lists during replication.
>>
>> Don't Need:
>>
>> Authentication. We can go to 1.0 without authentication, relying  
>> instead on local proxies to provide authentication.
>>
>> Partioning. Partitioning is a big project with lots of  
>> considerations. It's best to move this post 1.0.
>
>

Re: CouchDB 1.0 work

Posted by Ted Leung <tw...@sauria.com>.

What about trying to make a 0.8 release from the ASF repository?  Or  
would you rather do this starting at 1.0?

Ted

On Apr 28, 2008, at 9:27 AM, Damien Katz wrote:

> Here are my thoughts on what we need for before we can get to  
> CouchDB 1.0. Feedback please.
>
> Must have:
>
> Incremental reduce: Maybe single biggest outstanding work item.  
> Probably 2 weeks of development to get to a testable state
>
> Security/Document validation: We need a way to control who can  
> update what documents and to validate the updates are correct. This  
> is absolutely necessary for offline replication, where replicated  
> updates to the database do not come through the application layer.
>
> View index compaction/management: View indexes currently just grow,  
> need a compaction similar to storage compaction. Also, there is no  
> way to purge old unused indexes, except via the OS.
>
> File sync problem: file:sync(), a call that flushes all uncommitted  
> writes to disk before returning, doesn't work fully or at all on all  
> some platforms (usually we just lack the flags to tell the OS to  
> write to disk). Should be fixable by either patching the existing  
> Erlang driver source, or using a replacement file driver.
>
> Optimizations. Right now HTTP overhead is huge, with HTTP latency/ 
> overhead at about 80% of  our document read time when loaded from  
> local client (same machine). Once we can get this down to below 50%,  
> we'll focus on optimizing the database and other component. Most  
> core database operations, document reads, updates and view indexing  
> are completely unoptimized so far, which the update speed being the  
> biggest complaint.
>
> Testing: We need lots more tests. By the time we ship 1.0, we should  
> have far more test suite code than production code. And we need to  
> do load testing. Will the current browser based test suite can scale  
> for this kind of heavy testing?
>
> Nice to have:
>
> Plugs in: Erlang module plug-in architecture, to make adding new  
> server side code easy. Right now the code that maps special urls  
> (_view, _compact, _search, etc) to the appropriate Erlang call is  
> messy and convoluted, and getting worse as we go. We need a standard  
> way to map the special urls to the appropriate Erlang call.
>
> Tail committed database headers: To optimize the updating of  
> database by reducing the number and length of seeks required, the  
> file header should be written to the end of the file, rather than  
> the beginning. Depending on platform this can remove a full headseek  
> and in the best case scenario a document insert/update can require  
> zero head seeks (if the head is already positioned at the end of the  
> file). But this can slow file opening speed as it may need to do a  
> search in the file for the most recent valid header. In the result  
> of a crash, the header scan/search cost at database open can be  
> linear or logarithmic, depending on the exact implementation.
>
> Clustering: The ability to cluster CouchDB servers, to increase both  
> reliability (failover-clustering) and client scalability (more  
> servers to handle more concurrent user load). Clustering does not  
> increase data scalability, which is  (that's partitioning/sharding).
>
> Selective document purging/compaction: Deletion stubs are kept  
> around for replication purposes. Need a way to purge the records of  
> document that are old or deleted.
>
> Revision rev path pruning: Each document keeps a list of all  
> previous revisions. We need a way to prune the oldest records of  
> document revisions and remerge pruned lists during replication.
>
> Don't Need:
>
> Authentication. We can go to 1.0 without authentication, relying  
> instead on local proxies to provide authentication.
>
> Partioning. Partitioning is a big project with lots of  
> considerations. It's best to move this post 1.0.

Re: CouchDB 1.0 work

Posted by Benoit Chesneau <bc...@gmail.com>.

On Sat, May 10, 2008 at 7:36 PM, Christopher Lenz <cm...@gmx.de> wrote:
> On 10.05.2008, at 17:53, Damien Katz wrote:
>>
>> On May 10, 2008, at 11:35 AM, Christopher Lenz wrote:
>>>
>>> As far as I know, the proxy will keep the auth info to itself, and the
>>> request will look like a standard anonymous request to CouchDB. I *think* if
>>> we don't implement authentication, we can not implement
>>> authorization/security for document validation.
>>
>> Well, I don't know the details of authenticating proxies, but if the user
>> provides credentials in the HTTP header, and the proxy server validates it
>> and passes it on, then CouchDB would just use the same credentials with the
>> assumption they are authenticated because the HTTP server validated it. But
>> maybe this isn't possible for reasons I don't know about.
>
> I made a test with Apache/mod_proxy with Digest auth, and it does seem to
> pass through the auth credentials (username, realm, etc) via the
> Authorization header. So this should hopefully work in general, sorry for
> the noise :P
>
> Cheers,
> --
> Christopher Lenz
>  cmlenz at gmx.de
>  http://www.cmlenz.net/
>
>

the same with nginx or squid :)

-- 
- benoît

Re: CouchDB 1.0 work

Posted by Christopher Lenz <cm...@gmx.de>.

On 10.05.2008, at 17:53, Damien Katz wrote:
> On May 10, 2008, at 11:35 AM, Christopher Lenz wrote:
>> As far as I know, the proxy will keep the auth info to itself, and  
>> the request will look like a standard anonymous request to CouchDB.  
>> I *think* if we don't implement authentication, we can not  
>> implement authorization/security for document validation.
>
> Well, I don't know the details of authenticating proxies, but if the  
> user provides credentials in the HTTP header, and the proxy server  
> validates it and passes it on, then CouchDB would just use the same  
> credentials with the assumption they are authenticated because the  
> HTTP server validated it. But maybe this isn't possible for reasons  
> I don't know about.

I made a test with Apache/mod_proxy with Digest auth, and it does seem  
to pass through the auth credentials (username, realm, etc) via the  
Authorization header. So this should hopefully work in general, sorry  
for the noise :P

Cheers,
--
Christopher Lenz
   cmlenz at gmx.de
   http://www.cmlenz.net/

Re: CouchDB 1.0 work

Posted by Damien Katz <da...@gmail.com>.

On May 10, 2008, at 11:35 AM, Christopher Lenz wrote:

> On 10.05.2008, at 16:47, Damien Katz wrote:
>> On May 10, 2008, at 10:09 AM, Christopher Lenz wrote:
>>
>>> On 28.04.2008, at 18:27, Damien Katz wrote:
>>>> Here are my thoughts on what we need for before we can get to  
>>>> CouchDB 1.0. Feedback please.
>>>>
>>>> Must have:
>>> [...]
>>>> Security/Document validation: We need a way to control who can  
>>>> update what documents and to validate the updates are correct.  
>>>> This is absolutely necessary for offline replication, where  
>>>> replicated updates to the database do not come through the  
>>>> application layer.
>>> [...]
>>>> Don't Need:
>>>>
>>>> Authentication. We can go to 1.0 without authentication, relying  
>>>> instead on local proxies to provide authentication.
>>>
>>> So how would we provide authorization without authentication?  
>>> There needs to be some way to identify who's making a request, and  
>>> if we plan to rely on proxies for that, those proxies need to  
>>> provide a way to pass on the authentication results (e.g.  
>>> REMOTE_USER). I suspect they don't do that, but I may be wrong.
>>
>> I'm thinking the proxy server will authenticate the users  
>> credentials in the request HTTP header, then let the request pass  
>> normally to the CouchDB server. If it can't authenticate, then it  
>> rejects the request.
>
> Yeah, but how will CouchDB be able to use the authentication results  
> to provide the "Security/Document validation" feature?
>
>
> As far as I know, the proxy will keep the auth info to itself, and  
> the request will look like a standard anonymous request to CouchDB.  
> I *think* if we don't implement authentication, we can not implement  
> authorization/security for document validation.

Well, I don't know the details of authenticating proxies, but if the  
user provides credentials in the HTTP header, and the proxy server  
validates it and passes it on, then CouchDB would just use the same  
credentials with the assumption they are authenticated because the  
HTTP server validated it. But maybe this isn't possible for reasons I  
don't know about.

-Damien

Re: CouchDB 1.0 work

Posted by Christopher Lenz <cm...@gmx.de>.

On 10.05.2008, at 16:47, Damien Katz wrote:
> On May 10, 2008, at 10:09 AM, Christopher Lenz wrote:
>
>> On 28.04.2008, at 18:27, Damien Katz wrote:
>>> Here are my thoughts on what we need for before we can get to  
>>> CouchDB 1.0. Feedback please.
>>>
>>> Must have:
>> [...]
>>> Security/Document validation: We need a way to control who can  
>>> update what documents and to validate the updates are correct.  
>>> This is absolutely necessary for offline replication, where  
>>> replicated updates to the database do not come through the  
>>> application layer.
>> [...]
>>> Don't Need:
>>>
>>> Authentication. We can go to 1.0 without authentication, relying  
>>> instead on local proxies to provide authentication.
>>
>> So how would we provide authorization without authentication? There  
>> needs to be some way to identify who's making a request, and if we  
>> plan to rely on proxies for that, those proxies need to provide a  
>> way to pass on the authentication results (e.g. REMOTE_USER). I  
>> suspect they don't do that, but I may be wrong.
>
> I'm thinking the proxy server will authenticate the users  
> credentials in the request HTTP header, then let the request pass  
> normally to the CouchDB server. If it can't authenticate, then it  
> rejects the request.

Yeah, but how will CouchDB be able to use the authentication results  
to provide the "Security/Document validation" feature?

As far as I know, the proxy will keep the auth info to itself, and the  
request will look like a standard anonymous request to CouchDB. I  
*think* if we don't implement authentication, we can not implement  
authorization/security for document validation.

Cheers,
--
Christopher Lenz
   cmlenz at gmx.de
   http://www.cmlenz.net/

Re: CouchDB 1.0 work

Posted by Damien Katz <da...@gmail.com>.

On May 10, 2008, at 10:09 AM, Christopher Lenz wrote:

> On 28.04.2008, at 18:27, Damien Katz wrote:
>> Here are my thoughts on what we need for before we can get to  
>> CouchDB 1.0. Feedback please.
>>
>> Must have:
> [...]
>> Security/Document validation: We need a way to control who can  
>> update what documents and to validate the updates are correct. This  
>> is absolutely necessary for offline replication, where replicated  
>> updates to the database do not come through the application layer.
> [...]
>> Don't Need:
>>
>> Authentication. We can go to 1.0 without authentication, relying  
>> instead on local proxies to provide authentication.
>
> So how would we provide authorization without authentication? There  
> needs to be some way to identify who's making a request, and if we  
> plan to rely on proxies for that, those proxies need to provide a  
> way to pass on the authentication results (e.g. REMOTE_USER). I  
> suspect they don't do that, but I may be wrong.

I'm thinking the proxy server will authenticate the users credentials  
in the request HTTP header, then let the request pass normally to the  
CouchDB server. If it can't authenticate, then it rejects the request.

-Damien

Re: CouchDB 1.0 work

Posted by Christopher Lenz <cm...@gmx.de>.

On 28.04.2008, at 18:27, Damien Katz wrote:
> Here are my thoughts on what we need for before we can get to  
> CouchDB 1.0. Feedback please.
>
> Must have:
[...]
> Security/Document validation: We need a way to control who can  
> update what documents and to validate the updates are correct. This  
> is absolutely necessary for offline replication, where replicated  
> updates to the database do not come through the application layer.
[...]
> Don't Need:
>
> Authentication. We can go to 1.0 without authentication, relying  
> instead on local proxies to provide authentication.

So how would we provide authorization without authentication? There  
needs to be some way to identify who's making a request, and if we  
plan to rely on proxies for that, those proxies need to provide a way  
to pass on the authentication results (e.g. REMOTE_USER). I suspect  
they don't do that, but I may be wrong.

Cheers,
--
Christopher Lenz
   cmlenz at gmx.de
   http://www.cmlenz.net/

Re: CouchDB 1.0 work

Posted by Noah Slater <ns...@apache.org>.

On Mon, Apr 28, 2008 at 12:27:40PM -0400, Damien Katz wrote:
> Testing: We need lots more tests. By the time we ship 1.0, we should
> have far more test suite code than production code. And we need to do
> load testing. Will the current browser based test suite can scale for
> this kind of heavy testing?

Agreed, I want to move this into the build if possible, but that requires a way
to get XHR from the console, so not sure how best to approach that.

As for the rest, in total agreement.

-- 
Noah Slater - The Apache Software Foundation <http://www.apache.org/>

Re: CouchDB 1.0 work

Posted by Jan Lehnardt <ja...@apache.org>.

Additional thoughts:

Must have:
Refactoring of attachment API as per earlier discussions and
proposal by Christopher.

Nice to have:
In the course of speed and load tests and ultimately when
running a live sysyem, it would be nice to have more
introspection. Basically counters, stats to evaluate the
state of a CouchDB node.

Cheers
Jan
--

On Apr 28, 2008, at 18:27, Damien Katz wrote:
> Here are my thoughts on what we need for before we can get to  
> CouchDB 1.0. Feedback please.
>
> Must have:
>
> Incremental reduce: Maybe single biggest outstanding work item.  
> Probably 2 weeks of development to get to a testable state
>
> Security/Document validation: We need a way to control who can  
> update what documents and to validate the updates are correct. This  
> is absolutely necessary for offline replication, where replicated  
> updates to the database do not come through the application layer.
>
> View index compaction/management: View indexes currently just grow,  
> need a compaction similar to storage compaction. Also, there is no  
> way to purge old unused indexes, except via the OS.
>
> File sync problem: file:sync(), a call that flushes all uncommitted  
> writes to disk before returning, doesn't work fully or at all on all  
> some platforms (usually we just lack the flags to tell the OS to  
> write to disk). Should be fixable by either patching the existing  
> Erlang driver source, or using a replacement file driver.
>
> Optimizations. Right now HTTP overhead is huge, with HTTP latency/ 
> overhead at about 80% of  our document read time when loaded from  
> local client (same machine). Once we can get this down to below 50%,  
> we'll focus on optimizing the database and other component. Most  
> core database operations, document reads, updates and view indexing  
> are completely unoptimized so far, which the update speed being the  
> biggest complaint.
>
> Testing: We need lots more tests. By the time we ship 1.0, we should  
> have far more test suite code than production code. And we need to  
> do load testing. Will the current browser based test suite can scale  
> for this kind of heavy testing?
>
> Nice to have:
>
> Plugs in: Erlang module plug-in architecture, to make adding new  
> server side code easy. Right now the code that maps special urls  
> (_view, _compact, _search, etc) to the appropriate Erlang call is  
> messy and convoluted, and getting worse as we go. We need a standard  
> way to map the special urls to the appropriate Erlang call.
>
> Tail committed database headers: To optimize the updating of  
> database by reducing the number and length of seeks required, the  
> file header should be written to the end of the file, rather than  
> the beginning. Depending on platform this can remove a full headseek  
> and in the best case scenario a document insert/update can require  
> zero head seeks (if the head is already positioned at the end of the  
> file). But this can slow file opening speed as it may need to do a  
> search in the file for the most recent valid header. In the result  
> of a crash, the header scan/search cost at database open can be  
> linear or logarithmic, depending on the exact implementation.
>
> Clustering: The ability to cluster CouchDB servers, to increase both  
> reliability (failover-clustering) and client scalability (more  
> servers to handle more concurrent user load). Clustering does not  
> increase data scalability, which is  (that's partitioning/sharding).
>
> Selective document purging/compaction: Deletion stubs are kept  
> around for replication purposes. Need a way to purge the records of  
> document that are old or deleted.
>
> Revision rev path pruning: Each document keeps a list of all  
> previous revisions. We need a way to prune the oldest records of  
> document revisions and remerge pruned lists during replication.
>
> Don't Need:
>
> Authentication. We can go to 1.0 without authentication, relying  
> instead on local proxies to provide authentication.
>
> Partioning. Partitioning is a big project with lots of  
> considerations. It's best to move this post 1.0.
>

Re: CouchDB 1.0 work

Posted by Jan Lehnardt <ja...@apache.org>.

Heya Damien,
On Apr 28, 2008, at 18:27, Damien Katz wrote:
> Here are my thoughts on what we need for before we can get to  
> CouchDB 1.0. Feedback please.
>
> Must have:
>
> Incremental reduce: Maybe single biggest outstanding work item.  
> Probably 2 weeks of development to get to a testable state

I obviously agree here, but I guess you are the only one to
make any sensible estimates and all the coding. Which might
not be the best thing for the incubating project. Would it be
possible for you to 'take in an apprentice' that you tug along
while doing that work and bring him up to speed with that part
of the code? This will delay things and it might be impractical
(after all, who should be the apprentice :) and a stupid idea,
but it might make sense to add more people to the code.

> Security/Document validation: We need a way to control who can  
> update what documents and to validate the updates are correct. This  
> is absolutely necessary for offline replication, where replicated  
> updates to the database do not come through the application layer.

Do you have any more ideas on who the notion of 'who'
should be defined here? Is that an HTTP-Auth user,
something on the CouchDB level or something entirely
different?

Also, a feature request for the validation function is
to allow modifying the document before saving. It'd
be nice to have and we should keep that in mind while
designing this feature.

> View index compaction/management: View indexes currently just grow,  
> need a compaction similar to storage compaction. Also, there is no  
> way to purge old unused indexes, except via the OS.

My comments about the reduce feature apply here equally.

> File sync problem: file:sync(), a call that flushes all uncommitted  
> writes to disk before returning, doesn't work fully or at all on all  
> some platforms (usually we just lack the flags to tell the OS to  
> write to disk). Should be fixable by either patching the existing  
> Erlang driver source, or using a replacement file driver.

Fixing Erlang sound like the most solid solution here. I did try
to push the patch we had for inets upstream, but my mails never
reached their mailing list and I couldn't be bothered to investigate
because we've switched away from inets. Anyway: We'd need
someone to actively evangelize the patch with the Erlang
maintainers. This persons should be aware of all the implications
this patch introduces. From what I gathered, they generally accept
sensible patches, it just might take some time and the less
interrupting (not braking anything existing) the patch the more
likely it is accepted.

> Optimizations. Right now HTTP overhead is huge, with HTTP latency/ 
> overhead at about 80% of  our document read time when loaded from  
> local client (same machine). Once we can g
> et this down to below 50%, we'll focus on optimizing the database  
> and other component. Most core database operations, document reads,  
> updates and view indexing are completely unoptimized so far, which  
> the update speed being the biggest complaint.

Jumping past HTTP optimisation:
You mentioned a caching layer based on Erlang's
Judy-tree implementation (is that (d)ets btw?) at
some point. I assume that would speed up everything
that includes disk reads (including updates, who need
to know at least the latest revision of the doc that is
to be updated).

 From what I gathered with my config patch is that writing
a key-value storage module is trivial in Erlang, would a
caching system work in the same way?

> Testing: We need lots more tests. By the time we ship 1.0, we should  
> have far more test suite code than production code.

This is tedious of course. Maybe we can get together all
devs and everybody who wants to help out to go on a testing
spree to add test-cases and discuss only related issues
for a couple of days to get a bulk of the work done here?

Maybe with nice goals we can proud of reaching afterwards
and all the usual motivation-crap :)

> And we need to do load testing. Will the current browser based test  
> suite can scale for this kind of heavy testing?

I doubt (please prove me wrong) that we could have a
browser create enough load for a single node on reasonably
current hardware.

For load testing we obviously need at least two machines,
one doing the testing and one being tested, better three,
with another one for logging. Testing replication needs
even more machines. And ideally we have even a couple
of client machines to generate the load.

That said, the Ajax of the test suite only works if it is
served from the same host as CouchDB. You can circumvent
that using a proxy but that makes benchmarking harder.

I had some luck testing CouchDB using tsung (http://tsung.erlang-projects.org/ 
)
which is a bit daunting at first but maybe the best tool
to bench and profile Erlang server applications. It can use a
pool of client machines to bench a single server.

And Erlang comes with a built-in profiling and code
coverage solution that we might want to look at for
finding code hot-spots.

In conclusion, testing needs a lot of iron and time and maybe
some vendor can step in here and give us access to a testing
lab. Maybe we can get help from the ASF infrastructure?

> Nice to have:
>
> Plugs in: Erlang module plug-in architecture, to make adding new  
> server side code easy. Right now the code that maps special urls  
> (_view, _compact, _search, etc) to the appropriate Erlang call is  
> messy and convoluted, and getting worse as we go. We need a standard  
> way to map the special urls to the appropriate Erlang call.
>
> Tail committed database headers: To optimize the updating of  
> database by reducing the number and length of seeks required, the  
> file header should be written to the end of the file, rather than  
> the beginning. Depending on platform this can remove a full headseek  
> and in the best case scenario a document insert/update can require  
> zero head seeks (if the head is already positioned at the end of the  
> file). But this can slow file opening speed as it may need to do a  
> search in the file for the most recent valid header. In the result  
> of a crash, the header scan/search cost at database open can be  
> linear or logarithmic, depending on the exact implementation.

Maybe this could be a per-database option?

> Clustering: The ability to cluster CouchDB servers, to increase both  
> reliability (failover-clustering) and client scalability (more  
> servers to handle more concurrent user load). Clustering does not  
> increase data scalability, which is  (that's partitioning/sharding).

Some zeroconf-based auto-discovery and auto-config of new nodes would  
be totally kick-ass :)

> Selective document purging/compaction: Deletion stubs are kept  
> around for replication purposes. Need a way to purge the records of  
> document that are old or deleted.

> Revision rev path pruning: Each document keeps a list of all  
> previous revisions. We need a way to prune the oldest records of  
> document revisions and remerge pruned lists during replication.
>
> Don't Need:
>
> Authentication. We can go to 1.0 without authentication, relying  
> instead on local proxies to provide authentication.

+1

> Partioning. Partitioning is a big project with lots of  
> considerations. It's best to move this post 1.0.

+1

For the must-have: Is my config-patch considered to be accepted once I  
get into a shape that
addresses all concerns? Or should that be added to the list?

Cheers
Jan
--