You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@couchdb.apache.org by Randall Leeds <ra...@gmail.com> on 2009/03/30 02:59:33 UTC

CouchDB Cluster/Partition GSoC

To start, I'd like to introduce myself as I've been off and on contributing
in tiny ways to dev list activity and a little IRC chatter, but not super
visible in the community.

My name is Randall and I'm a student at Brown University in Providence,
Rhode Island, USA. I've got one more semester ahead of me in my
undergraduate degree. I've been working with CouchDB on the Melkjug[1]
project since June and have been intermittently active with
couchdb-python[2] as a committer fixing small bugs.

I'd like to create and polish a proposal this week for submission as a
Google Summer of Code Project.

To that end, this thread is to start the drafting process and determine a
prioritized list of tasks and inter-task-dependencies required to get a
smooth clustering and partitioning experience in CouchDB supported.

Skip the next section if you just want to read my questions and jump right
into the discussion.

Otherwise, here's a brief overview of background information:

A clarification of terms:

On Fri, Feb 20, 2009 at 2:45 PM, Damien Katz <da...@apache.org> wrote:
> I see partitioning and clustering as 2 different things. Partitioning is
> data partitioning, spreading the data out across nodes, no node having the
> complete database. Clustering is nodes having the same, or nearly the same
> data (they might be behind on replicating changes, but otherwise they have
> the same data).
source:
http://mail-archives.apache.org/mod_mbox/couchdb-dev/200902.mbox/%3Ce282921e0902191754v4b29f8den98f083481767b7bf@mail.gmail.com%3E

Existing partitioning proposal[3] on the wiki:
http://wiki.apache.org/couchdb/Partitioning_proposal

>From an e-mail between myself and Chris A on first steps:

Chris wrote:
>I think as far as writing goes, there's still more work to be done on
>design, but there are some pieces that can be written first:
>
> * consistent hashing Erlang proxy (start out with HTTP)
> * view merging across partitioned nodes
>
>These two can be run as their own software at first, so they can sit
>in front of a cluster of CouchDB machines without any changes
>happening to CouchDB. Once they work, they can be tied to CouchDB
>using Erlang terms and IPC instead of JSON/HTTP.
>
>There are some design questions about a partitioned CouchDB that we
>should probably take up on the list:
>
> * what about _all_docs and other node-global queries?
> * does a cluster use a single seq-num or does each node have it's own?

Finally, the great folks from Meebo have recently posted couchdb-lounge[4]
which uses an nginx proxy to make a "partitioning/clustering framework for
CouchDB".

The questions we should address are prioritized here:
1) What's required to make CouchDB a full OTP application? Isn't it using
gen_server already?
2) What about _all_docs and seq-num?
3) Can we agree on a proposed solution to the layout of partition nodes? I
like the tree solution, as long as it is extremely flexible wrt tree depth.
4) Should the consistent hashing algorithm map ids to leaf nodes or just to
children? I lean toward children because it encapsulates knowledge about the
layout of subtrees at each tree level.

Submissions for GSoC are due by Friday so I would appreciate any help in
polishing a proposal that will best serve the needs of the CouchDB
community. Hopefully this generates some initial discussion that will lead
me to a draft proposal in the next couple of days which I will post for
revision and comment until I submit it at the end of the week.

Thanks in advance,
Randall

[1] http://www.openplans.org/projects/melkjug/project-home
[2] http://code.google.com/p/couchdb-python/
[3] http://wiki.apache.org/couchdb/Partitioning_proposal
[4] http://code.google.com/p/couchdb-lounge/

Re: CouchDB Cluster/Partition GSoC

Posted by Chris Anderson <jc...@gmail.com>.

Sent from my iPhone

On Apr 7, 2009, at 10:46 PM, Randall Leeds <ra...@gmail.com>  
wrote:

> On Wed, Apr 8, 2009 at 01:41, Randall Leeds  
> <ra...@gmail.com> wrote:
>
>> Thanks for the suggestions, Chris.
>> Link is still here:
>> http://socghop.appspot.com/document/show/user/rleeds/couchdb_cluster
>>
>> I can't seem to access the edit page for the official proposal  
>> submission
>> right now. I get an error.
>> However, I've done some updates. At this point, I'm hoping that you  
>> or
>> Damien might consider picking this up and decide to endorse it and  
>> become a
>> mentor. Then it's up to the foundation and Google!
>
>
> I suppose if you do decide to, a link to the proposal should  
> probably go
> here:
> http://wiki.apache.org/general/SummerOfCode2009#couchdb-project
> Proposal URL:
> http://socghop.appspot.com/student_proposal/show/google/gsoc2009/rleeds/t123878289629
>
> It's a shame I can't seem to edit the proposal right now, so maybe a  
> link to
> the document version since it's more up-to-date?
>
>

You should be able to edit the wiki at least.

I'd be happy to mentor your project. I can certainly help you work in  
the Apache way, and maybe I can help a little with the technology.

>>
>>
>> Either way, I want to be involved in this work :)
>>

Being unstoppable with the patches is the most important thing.  
Looking forward to it.

>> Cheers,
>> Randall
>>
>>
>> On Fri, Apr 3, 2009 at 16:30, Chris Anderson <jc...@apache.org>  
>> wrote:
>>
>>>
>>> From the proposal:
>>>> 2. Fast http proxy writen in Erlang which leverages the  
>>>> consistent hash
>>> for determining destinations
>>>
>>> You might find it simpler to use Erlang messaging instead of http in
>>> the proxy layer. I'm not certain about this but it might end up
>>> simpler and faster in the long run. There are arguments in favor of
>>> http, so I'd say the choice is yours, but keep in mind someone will
>>> eventually attempt the other way, no matter which you chose.
>>
>>
>> Yeah, this is what I had in mind after we talked and I wrote this  
>> wrong.
>>
>>
>>>
>>>> August 10 - Submit patches for review, discussion and polishing
>>>
>>> I think it would make for a smoother process if you attempt to
>>> integrate as you go. It'll mean identifying the smallest useful  
>>> chunks
>>> of work, to get us from here to there, but it's also the open-source
>>> way, and I think it results in better code. Nothing like having what
>>> you're working on being used in real applications.
>>>
>>> Can you identify the very first step? - maybe it's an integration  
>>> test
>>> in JavaScript that proves that three dbs (on one host) can have
>>> document ids partitioned correctly. (I think a core thing here is
>>> getting the right validation functions on the right db's, so they
>>> reject bad PUTs)
>>
>>
>> I finally got around to a crack at adding some JS test examples.
>> I'd like to add some examples about querying partition setup, etc,  
>> but then
>> again, that might just be in the _design doc. There are so many  
>> questions
>> unsettled still that I feel like what I added is probably enough to  
>> get a
>> feel for it.
>>
>>

Re: CouchDB Cluster/Partition GSoC

Posted by Randall Leeds <ra...@gmail.com>.
On Wed, Apr 8, 2009 at 01:41, Randall Leeds <ra...@gmail.com> wrote:

> Thanks for the suggestions, Chris.
> Link is still here:
> http://socghop.appspot.com/document/show/user/rleeds/couchdb_cluster
>
> I can't seem to access the edit page for the official proposal submission
> right now. I get an error.
> However, I've done some updates. At this point, I'm hoping that you or
> Damien might consider picking this up and decide to endorse it and become a
> mentor. Then it's up to the foundation and Google!


I suppose if you do decide to, a link to the proposal should probably go
here:
http://wiki.apache.org/general/SummerOfCode2009#couchdb-project
Proposal URL:
http://socghop.appspot.com/student_proposal/show/google/gsoc2009/rleeds/t123878289629

It's a shame I can't seem to edit the proposal right now, so maybe a link to
the document version since it's more up-to-date?


>
>
> Either way, I want to be involved in this work :)
>
> Cheers,
> Randall
>
>
> On Fri, Apr 3, 2009 at 16:30, Chris Anderson <jc...@apache.org> wrote:
>
>>
>> From the proposal:
>> > 2. Fast http proxy writen in Erlang which leverages the consistent hash
>> for determining destinations
>>
>> You might find it simpler to use Erlang messaging instead of http in
>> the proxy layer. I'm not certain about this but it might end up
>> simpler and faster in the long run. There are arguments in favor of
>> http, so I'd say the choice is yours, but keep in mind someone will
>> eventually attempt the other way, no matter which you chose.
>
>
> Yeah, this is what I had in mind after we talked and I wrote this wrong.
>
>
>>
>> > August 10 - Submit patches for review, discussion and polishing
>>
>> I think it would make for a smoother process if you attempt to
>> integrate as you go. It'll mean identifying the smallest useful chunks
>> of work, to get us from here to there, but it's also the open-source
>> way, and I think it results in better code. Nothing like having what
>> you're working on being used in real applications.
>>
>> Can you identify the very first step? - maybe it's an integration test
>> in JavaScript that proves that three dbs (on one host) can have
>> document ids partitioned correctly. (I think a core thing here is
>> getting the right validation functions on the right db's, so they
>> reject bad PUTs)
>
>
> I finally got around to a crack at adding some JS test examples.
> I'd like to add some examples about querying partition setup, etc, but then
> again, that might just be in the _design doc. There are so many questions
> unsettled still that I feel like what I added is probably enough to get a
> feel for it.
>
>

Re: CouchDB Cluster/Partition GSoC

Posted by Randall Leeds <ra...@gmail.com>.
Thanks for the suggestions, Chris.
Link is still here:
http://socghop.appspot.com/document/show/user/rleeds/couchdb_cluster

I can't seem to access the edit page for the official proposal submission
right now. I get an error.
However, I've done some updates. At this point, I'm hoping that you or
Damien might consider picking this up and decide to endorse it and become a
mentor. Then it's up to the foundation and Google!

Either way, I want to be involved in this work :)

Cheers,
Randall


On Fri, Apr 3, 2009 at 16:30, Chris Anderson <jc...@apache.org> wrote:

>
> From the proposal:
> > 2. Fast http proxy writen in Erlang which leverages the consistent hash
> for determining destinations
>
> You might find it simpler to use Erlang messaging instead of http in
> the proxy layer. I'm not certain about this but it might end up
> simpler and faster in the long run. There are arguments in favor of
> http, so I'd say the choice is yours, but keep in mind someone will
> eventually attempt the other way, no matter which you chose.


Yeah, this is what I had in mind after we talked and I wrote this wrong.


>
> > August 10 - Submit patches for review, discussion and polishing
>
> I think it would make for a smoother process if you attempt to
> integrate as you go. It'll mean identifying the smallest useful chunks
> of work, to get us from here to there, but it's also the open-source
> way, and I think it results in better code. Nothing like having what
> you're working on being used in real applications.
>
> Can you identify the very first step? - maybe it's an integration test
> in JavaScript that proves that three dbs (on one host) can have
> document ids partitioned correctly. (I think a core thing here is
> getting the right validation functions on the right db's, so they
> reject bad PUTs)


I finally got around to a crack at adding some JS test examples.
I'd like to add some examples about querying partition setup, etc, but then
again, that might just be in the _design doc. There are so many questions
unsettled still that I feel like what I added is probably enough to get a
feel for it.

Re: CouchDB Cluster/Partition GSoC

Posted by Chris Anderson <jc...@apache.org>.
On Fri, Apr 3, 2009 at 11:42 AM, Randall Leeds <ra...@gmail.com> wrote:
> Just wanted to make it available to everyone to see now, in case you're
> curious, even though review/comment time is past.
>
> I meant to get this out earlier to allow some review time, but I just got
> buried too deep in academics this week.
>
> The proposal has been officially submitted, but you need a GSoC site account
> to view it, so here's the public document that contains the same info:
> http://socghop.appspot.com/document/show/user/rleeds/couchdb_cluster
>
> For those with an account, the submitted proposal is here:
> http://socghop.appspot.com/student_proposal/show/google/gsoc2009/rleeds/t123878289629
>
> I'm not terribly concerned if the details aren't all there, as long as the
> proposal itself is convincing enough. The plan includes much discussion and
> hashing out of the details... i.e. more of the discussion in this thread
> along with some profiling and planning before execution of the nitty gritty
> stuff happens.
>
> Thanks for all the advice and help

>From the proposal:
> 2. Fast http proxy writen in Erlang which leverages the consistent hash for determining destinations

You might find it simpler to use Erlang messaging instead of http in
the proxy layer. I'm not certain about this but it might end up
simpler and faster in the long run. There are arguments in favor of
http, so I'd say the choice is yours, but keep in mind someone will
eventually attempt the other way, no matter which you chose.

> August 10 - Submit patches for review, discussion and polishing

I think it would make for a smoother process if you attempt to
integrate as you go. It'll mean identifying the smallest useful chunks
of work, to get us from here to there, but it's also the open-source
way, and I think it results in better code. Nothing like having what
you're working on being used in real applications.

Can you identify the very first step? - maybe it's an integration test
in JavaScript that proves that three dbs (on one host) can have
document ids partitioned correctly. (I think a core thing here is
getting the right validation functions on the right db's, so they
reject bad PUTs)

Hope that helps,
Chris

-- 
Chris Anderson
http://jchrisa.net
http://couch.io

Re: CouchDB Cluster/Partition GSoC

Posted by Randall Leeds <ra...@gmail.com>.
Just wanted to make it available to everyone to see now, in case you're
curious, even though review/comment time is past.

I meant to get this out earlier to allow some review time, but I just got
buried too deep in academics this week.

The proposal has been officially submitted, but you need a GSoC site account
to view it, so here's the public document that contains the same info:
http://socghop.appspot.com/document/show/user/rleeds/couchdb_cluster

For those with an account, the submitted proposal is here:
http://socghop.appspot.com/student_proposal/show/google/gsoc2009/rleeds/t123878289629

I'm not terribly concerned if the details aren't all there, as long as the
proposal itself is convincing enough. The plan includes much discussion and
hashing out of the details... i.e. more of the discussion in this thread
along with some profiling and planning before execution of the nitty gritty
stuff happens.

Thanks for all the advice and help
-Randall

Re: CouchDB Cluster/Partition GSoC

Posted by Brad Anderson <br...@sankatygroup.com>.
On Apr 1, 2009, at 11:58 AM, Chris Anderson wrote:

> On Wed, Apr 1, 2009 at 8:37 AM, Adam Kocoloski <ko...@apache.org>  
> wrote:
>> On Apr 1, 2009, at 11:03 AM, Chris Anderson wrote:
>>
>>>>>  2) What about _all_docs and seq-num?
>>>>>
>>>>> I presume _all_docs gets merged like any other view.   
>>>>> _all_docs_by_seq
>>>>> is a
>>>>> different story.  In the current code the sequence number is  
>>>>> incremented
>>>>> by
>>>>> one for every update.  If we want to preserve that behavior in
>>>>> partitioned
>>>>> databases we need some sort of consensus algorithm or master  
>>>>> server to
>>>>> choose the next sequence number.  It could easily turn into a  
>>>>> bottleneck
>>>>> or
>>>>> single point-of-failure if we're not careful.
>>>>>
>>>>> The alternatives are to a) abandon the current format for update
>>>>> sequences
>>>>> in favor of vector clocks or something more opaque, or b) have
>>>>> _all_docs_by_seq be strictly a node-local query.  I'd prefer the  
>>>>> former,
>>>>> as
>>>>> I think it will be useful for e.g. external indexers to treat the
>>>>> partitioned database just like a single server one.  If we do the
>>>>> latter, I
>>>>> think it means that either the external indexers have to be  
>>>>> installed on
>>>>> every node, or at least they have to be aware of all the  
>>>>> partitions.
>>>>
>>>>
>>>> If at all possible I think we should have the entire partition  
>>>> group
>>>> appear
>>>> as a single server from the outside. One thing that comes to mind  
>>>> here is
>>>> a
>>>> question about sequence numbers. Vector clocks only guarantee a  
>>>> partial
>>>> ordering, but I'm under the impression that currently seq numbers  
>>>> have a
>>>> strict ordering.
>>>>
>>>> Database sequence numbers are used in replication and in  
>>>> determining
>>>> whether
>>>> views need refreshing. Anything else I'm missing? Currently it  
>>>> seems
>>>> there
>>>> is no tracking of which updates actually change a view index  
>>>> (comment on
>>>> line 588 of couch_httpd_view.erl on trunk). Improving this would  
>>>> be a
>>>> nice
>>>> win. See my answer to number (3).
>>>>
>>>> The easy way to manage seq numbers is to let one node be the  
>>>> write master
>>>> for any cluster. (The root node of any partition group could  
>>>> actually be
>>>> a
>>>> cluster, but if writes always go through a master the master can  
>>>> maintain
>>>> the global sequence number for the partition group).
>>>
>>> The problem with this approach is that the main use-case for
>>> partitioning is when your incoming writes exceed the capacity of a
>>> single node. By partitioning the key-space, you can get more
>>> write-throughput.
>>
>> I think Randall was saying requests just have to originate at the  
>> master
>> node.  That master node could do nothing more than assign a  
>> sequence number,
>> choose a node, and proxy the request down the tree for the heavy  
>> lifting.  I
>> bet we could get pretty good throughput, but I still worry about this
>> approach for availability reasons.
>
> Yes, I agree. I think vector clocks are a good compromise. I hadn't
> considered that since index updates are idempotent, we can allow a
> little slop in the global clock. This makes everything much easier.
>
>>
>>> I'm not sure that an update-seq per node is such a bad thing, as it
>>> will require any external indexers to be deployed in a 1-to-1
>>> relationship to nodes, which automatically balances the load for the
>>> indexer as well. With a merged seq-id, users would be encouraged to
>>> partition CouchDB without bothering to partition indexers. Maybe  
>>> this
>>> is acceptable in some cases, but not in the general case.
>>
>> So, the vector clock approach still has a per-node update sequence  
>> for each
>> node's local clock, it just does the best job possible of globally  
>> ordering
>> those per-node sequences.  We could easily offer local update  
>> sequences as
>> well via some query string parameter.  I understand the desire to  
>> encourage
>> partitioned indexers, but I believe that won't always be possible.   
>> Bottom
>> line, I think we should support global indexing of a partitioned DB.
>>
>
> I think you're right. As long as partitioned indexers are possible, I
> have nothing against making global indexers as well.
>
>>
>> I'd like to hear more about how we implement redundancy and handle  
>> node
>> failures in the tree structure.  In a pure consistent hashing ring,  
>> whether
>> globally connected (Dynamo) or not (Chord), there are clear  
>> procedures for
>> dealing with node failures, usually involving storing copies of the  
>> data at
>> adjacent nodes along the ring.  Do we have an analogue of that in  
>> the tree?
>>  I'm especially worried about what happens when inner nodes go down.
>>
>
> I like to think of partitioning and redundancy as orthogonal. If each
> node has a hot-failover "twin", then parent nodes can track for all of
> their children, the children's twins as well, and swap them out in
> case of unavailability.
>
> I'm not so hot on the Chord / Dynamo style of storing parts of
> partitions on other partitions. Even just saying that is confusing.
>
> Because physical nodes need not map directly to logical nodes, we just
> need to be sure that each node's twin lives on a different physical
> node (which it can share with other logical nodes).
>
> The end result is that we can have N duplicates of the entire tree,
> and even load-balance across them. It'd be a small optimization to
> allow you to read from both twins and write to just one.
>
> Chris

So I have not poked around CouchDB for a while, but recently began to  
monitor the ML's again, so forgive me if this has been hashed out  
already...  Do we have to choose one way or another on the issues  
discussed in this thread?  Or is it a better 'core-couch' design  
decision to make these things pluggable, a la emacs, eclipse, trac?

i.e. if you want chord / peer-to-peer storage, use that plugin, or if  
you want vector clocks, use that plugin.  Or different indexers /  
strategies, use appropriate plugin.  Core couch need only provide well- 
reasoned stubs or extension points for these plugins.  Or does this  
decouple existing functionality and design goals too much?  I could  
definitely see a way in this design to get to a custom erlang term  
storage engine, no json impedence mismatch, and a native erlang view  
engine.

Cheers,
BA


Re: CouchDB Cluster/Partition GSoC

Posted by Chris Anderson <jc...@apache.org>.
On Wed, Apr 1, 2009 at 8:37 AM, Adam Kocoloski <ko...@apache.org> wrote:
> On Apr 1, 2009, at 11:03 AM, Chris Anderson wrote:
>
>>>>  2) What about _all_docs and seq-num?
>>>>
>>>> I presume _all_docs gets merged like any other view.  _all_docs_by_seq
>>>> is a
>>>> different story.  In the current code the sequence number is incremented
>>>> by
>>>> one for every update.  If we want to preserve that behavior in
>>>> partitioned
>>>> databases we need some sort of consensus algorithm or master server to
>>>> choose the next sequence number.  It could easily turn into a bottleneck
>>>> or
>>>> single point-of-failure if we're not careful.
>>>>
>>>> The alternatives are to a) abandon the current format for update
>>>> sequences
>>>> in favor of vector clocks or something more opaque, or b) have
>>>> _all_docs_by_seq be strictly a node-local query.  I'd prefer the former,
>>>> as
>>>> I think it will be useful for e.g. external indexers to treat the
>>>> partitioned database just like a single server one.  If we do the
>>>> latter, I
>>>> think it means that either the external indexers have to be installed on
>>>> every node, or at least they have to be aware of all the partitions.
>>>
>>>
>>> If at all possible I think we should have the entire partition group
>>> appear
>>> as a single server from the outside. One thing that comes to mind here is
>>> a
>>> question about sequence numbers. Vector clocks only guarantee a partial
>>> ordering, but I'm under the impression that currently seq numbers have a
>>> strict ordering.
>>>
>>> Database sequence numbers are used in replication and in determining
>>> whether
>>> views need refreshing. Anything else I'm missing? Currently it seems
>>> there
>>> is no tracking of which updates actually change a view index (comment on
>>> line 588 of couch_httpd_view.erl on trunk). Improving this would be a
>>> nice
>>> win. See my answer to number (3).
>>>
>>> The easy way to manage seq numbers is to let one node be the write master
>>> for any cluster. (The root node of any partition group could actually be
>>> a
>>> cluster, but if writes always go through a master the master can maintain
>>> the global sequence number for the partition group).
>>
>> The problem with this approach is that the main use-case for
>> partitioning is when your incoming writes exceed the capacity of a
>> single node. By partitioning the key-space, you can get more
>> write-throughput.
>
> I think Randall was saying requests just have to originate at the master
> node.  That master node could do nothing more than assign a sequence number,
> choose a node, and proxy the request down the tree for the heavy lifting.  I
> bet we could get pretty good throughput, but I still worry about this
> approach for availability reasons.

Yes, I agree. I think vector clocks are a good compromise. I hadn't
considered that since index updates are idempotent, we can allow a
little slop in the global clock. This makes everything much easier.

>
>> I'm not sure that an update-seq per node is such a bad thing, as it
>> will require any external indexers to be deployed in a 1-to-1
>> relationship to nodes, which automatically balances the load for the
>> indexer as well. With a merged seq-id, users would be encouraged to
>> partition CouchDB without bothering to partition indexers. Maybe this
>> is acceptable in some cases, but not in the general case.
>
> So, the vector clock approach still has a per-node update sequence for each
> node's local clock, it just does the best job possible of globally ordering
> those per-node sequences.  We could easily offer local update sequences as
> well via some query string parameter.  I understand the desire to encourage
> partitioned indexers, but I believe that won't always be possible.  Bottom
> line, I think we should support global indexing of a partitioned DB.
>

I think you're right. As long as partitioned indexers are possible, I
have nothing against making global indexers as well.

>
> I'd like to hear more about how we implement redundancy and handle node
> failures in the tree structure.  In a pure consistent hashing ring, whether
> globally connected (Dynamo) or not (Chord), there are clear procedures for
> dealing with node failures, usually involving storing copies of the data at
> adjacent nodes along the ring.  Do we have an analogue of that in the tree?
>  I'm especially worried about what happens when inner nodes go down.
>

I like to think of partitioning and redundancy as orthogonal. If each
node has a hot-failover "twin", then parent nodes can track for all of
their children, the children's twins as well, and swap them out in
case of unavailability.

I'm not so hot on the Chord / Dynamo style of storing parts of
partitions on other partitions. Even just saying that is confusing.

Because physical nodes need not map directly to logical nodes, we just
need to be sure that each node's twin lives on a different physical
node (which it can share with other logical nodes).

The end result is that we can have N duplicates of the entire tree,
and even load-balance across them. It'd be a small optimization to
allow you to read from both twins and write to just one.

Chris

-- 
Chris Anderson
http://jchrisa.net
http://couch.io

Re: CouchDB Cluster/Partition GSoC

Posted by Adam Kocoloski <ko...@apache.org>.
On Apr 1, 2009, at 11:03 AM, Chris Anderson wrote:

>>>  2) What about _all_docs and seq-num?
>>>
>>> I presume _all_docs gets merged like any other view.   
>>> _all_docs_by_seq is a
>>> different story.  In the current code the sequence number is  
>>> incremented by
>>> one for every update.  If we want to preserve that behavior in  
>>> partitioned
>>> databases we need some sort of consensus algorithm or master  
>>> server to
>>> choose the next sequence number.  It could easily turn into a  
>>> bottleneck or
>>> single point-of-failure if we're not careful.
>>>
>>> The alternatives are to a) abandon the current format for update  
>>> sequences
>>> in favor of vector clocks or something more opaque, or b) have
>>> _all_docs_by_seq be strictly a node-local query.  I'd prefer the  
>>> former, as
>>> I think it will be useful for e.g. external indexers to treat the
>>> partitioned database just like a single server one.  If we do the  
>>> latter, I
>>> think it means that either the external indexers have to be  
>>> installed on
>>> every node, or at least they have to be aware of all the partitions.
>>
>>
>> If at all possible I think we should have the entire partition  
>> group appear
>> as a single server from the outside. One thing that comes to mind  
>> here is a
>> question about sequence numbers. Vector clocks only guarantee a  
>> partial
>> ordering, but I'm under the impression that currently seq numbers  
>> have a
>> strict ordering.
>>
>> Database sequence numbers are used in replication and in  
>> determining whether
>> views need refreshing. Anything else I'm missing? Currently it  
>> seems there
>> is no tracking of which updates actually change a view index  
>> (comment on
>> line 588 of couch_httpd_view.erl on trunk). Improving this would be  
>> a nice
>> win. See my answer to number (3).
>>
>> The easy way to manage seq numbers is to let one node be the write  
>> master
>> for any cluster. (The root node of any partition group could  
>> actually be a
>> cluster, but if writes always go through a master the master can  
>> maintain
>> the global sequence number for the partition group).
>
> The problem with this approach is that the main use-case for
> partitioning is when your incoming writes exceed the capacity of a
> single node. By partitioning the key-space, you can get more
> write-throughput.

I think Randall was saying requests just have to originate at the  
master node.  That master node could do nothing more than assign a  
sequence number, choose a node, and proxy the request down the tree  
for the heavy lifting.  I bet we could get pretty good throughput, but  
I still worry about this approach for availability reasons.

> I'm not sure that an update-seq per node is such a bad thing, as it
> will require any external indexers to be deployed in a 1-to-1
> relationship to nodes, which automatically balances the load for the
> indexer as well. With a merged seq-id, users would be encouraged to
> partition CouchDB without bothering to partition indexers. Maybe this
> is acceptable in some cases, but not in the general case.

So, the vector clock approach still has a per-node update sequence for  
each node's local clock, it just does the best job possible of  
globally ordering those per-node sequences.  We could easily offer  
local update sequences as well via some query string parameter.  I  
understand the desire to encourage partitioned indexers, but I believe  
that won't always be possible.  Bottom line, I think we should support  
global indexing of a partitioned DB.

>>> One other thing that bothers me is the merge-sort required for  
>>> every view
>>> lookup.  In *really* large clusters it won't be good if queries  
>>> for a single
>>> key in a view have to hit each partition.  We could have an  
>>> alternative
>>> structure where each view gets partitioned much like the document  
>>> data while
>>> its built.  I worry that a view partitioned in this way may need  
>>> frequent
>>> rebalancing during the build, since view keys are probably not  
>>> going to be
>>> uniformly distributed.  Nevertheless, I think the benefit of  
>>> having many
>>> view queries only hit a small subset of nodes in the cluster is  
>>> pretty huge.
>>
>>
>> I agree that the merge-sort is something we need to look at  
>> carefully. We
>> should never hit a node in a view query unless it has data we need.  
>> We
>> certainly can't avoid merging altogether, but we can make an effort  
>> to do
>> smart rebalancing later on.
>>
>
> I think rebalancing aka shuffling will turn out to be one of those
> devil-in-the-details things. Because any document can emit any key, in
> the case of rebalancing, if you have to rebuild part of an index due
> to node-failure, you'd need to re-request from every other node, any
> view rows that might fit in that range. This requires every node to
> know about every other node.
>
> If view data is stored with document data, then nodes need only know
> about their child nodes, in the tree structure. Recovering from
> node-failure is easy: just swap in the failed node's hot-backup, and
> regenerate the views on it.
>
> I agree that the cost of merge sort will be ongoing, but I think the
> simplicity of this approach at least indicates that we should take it
> for the initial work. If we consider rebalancing an optimization, we
> can add it later.
>
> I think a better optimization would be to have inner nodes of the tree
> lazily cache the view rows of their children. This way the computation
> is spread out but the hops for popular queries can be mostly
> eliminated.

Ok, +1 from me.  I totally agree that rebalancing can get hairy, but  
hey, that's what makes this fun!

>>>  4) Should the consistent hashing algorithm map ids to leaf nodes  
>>> or just
>>>> to
>>>> children? I lean toward children because it encapsulates  
>>>> knowledge about
>>>> the
>>>> layout of subtrees at each tree level.
>>>>
>>>
>>> If the algorithm maps to children, does that mean every document  
>>> lookup has
>>> to traverse the tree?  I'm not sure that's a great idea.  Up to  
>>> ~100 nodes I
>>> think it may be better to have all document lookups take O(1)  
>>> hops.  I think
>>> distributed Erlang can keep a system of that size globally  
>>> connected without
>>> too much trouble.
>>
>
> I like the strict tree approach. I'd translate Adam's comment as:
> distributed Erlang can probably handle a tree of depth=1, even with
> ~100 nodes.

I'd like to hear more about how we implement redundancy and handle  
node failures in the tree structure.  In a pure consistent hashing  
ring, whether globally connected (Dynamo) or not (Chord), there are  
clear procedures for dealing with node failures, usually involving  
storing copies of the data at adjacent nodes along the ring.  Do we  
have an analogue of that in the tree?  I'm especially worried about  
what happens when inner nodes go down.

Best, Adam


Re: CouchDB Cluster/Partition GSoC

Posted by Chris Anderson <jc...@apache.org>.
On Tue, Mar 31, 2009 at 10:59 PM, Randall Leeds <ra...@gmail.com> wrote:
> On Sun, Mar 29, 2009 at 22:12, Adam Kocoloski <ko...@apache.org> wrote:
>
>> Hi Randall, cool!  I can chime in on a couple of the questions ...
>>
>
> Adam, thanks for your quick reply and thorough comments. The more people
> chime in on this discussion the better I can make the proposal, both in
> terms of likelihood for acceptance by a mentor/Google and the value of the
> resulting work for the community. I will aim to post a formalized draft of
> the proposal on my GSoC registration page sometime tomorrow and open it up
> for comments. Submission deadline is Friday.
>
>
>> On Mar 29, 2009, at 8:59 PM, Randall Leeds wrote:
>>
>>  1) What's required to make CouchDB a full OTP application? Isn't it using
>>> gen_server already?
>>>
>>
>> Yes, in fact CouchDB is already an OTP application using supervisors,
>> gen_servers, and gen_events.  There are situations in which it could do a
>> better job of adhering to OTP principles, and it could probably also use
>> some refactoring to make the partitioning code fit in easily.
>
>
> Cool. We'll see what we need here as we clarify other issues.
>
>
>>
>>  2) What about _all_docs and seq-num?
>>>
>>
>> I presume _all_docs gets merged like any other view.  _all_docs_by_seq is a
>> different story.  In the current code the sequence number is incremented by
>> one for every update.  If we want to preserve that behavior in partitioned
>> databases we need some sort of consensus algorithm or master server to
>> choose the next sequence number.  It could easily turn into a bottleneck or
>> single point-of-failure if we're not careful.
>>
>> The alternatives are to a) abandon the current format for update sequences
>> in favor of vector clocks or something more opaque, or b) have
>> _all_docs_by_seq be strictly a node-local query.  I'd prefer the former, as
>> I think it will be useful for e.g. external indexers to treat the
>> partitioned database just like a single server one.  If we do the latter, I
>> think it means that either the external indexers have to be installed on
>> every node, or at least they have to be aware of all the partitions.
>
>
> If at all possible I think we should have the entire partition group appear
> as a single server from the outside. One thing that comes to mind here is a
> question about sequence numbers. Vector clocks only guarantee a partial
> ordering, but I'm under the impression that currently seq numbers have a
> strict ordering.
>
> Database sequence numbers are used in replication and in determining whether
> views need refreshing. Anything else I'm missing? Currently it seems there
> is no tracking of which updates actually change a view index (comment on
> line 588 of couch_httpd_view.erl on trunk). Improving this would be a nice
> win. See my answer to number (3).
>
> The easy way to manage seq numbers is to let one node be the write master
> for any cluster. (The root node of any partition group could actually be a
> cluster, but if writes always go through a master the master can maintain
> the global sequence number for the partition group).

The problem with this approach is that the main use-case for
partitioning is when your incoming writes exceed the capacity of a
single node. By partitioning the key-space, you can get more
write-throughput.

I'm not sure that an update-seq per node is such a bad thing, as it
will require any external indexers to be deployed in a 1-to-1
relationship to nodes, which automatically balances the load for the
indexer as well. With a merged seq-id, users would be encouraged to
partition CouchDB without bothering to partition indexers. Maybe this
is acceptable in some cases, but not in the general case.

>
>
>> One other thing that bothers me is the merge-sort required for every view
>> lookup.  In *really* large clusters it won't be good if queries for a single
>> key in a view have to hit each partition.  We could have an alternative
>> structure where each view gets partitioned much like the document data while
>> its built.  I worry that a view partitioned in this way may need frequent
>> rebalancing during the build, since view keys are probably not going to be
>> uniformly distributed.  Nevertheless, I think the benefit of having many
>> view queries only hit a small subset of nodes in the cluster is pretty huge.
>
>
> I agree that the merge-sort is something we need to look at carefully. We
> should never hit a node in a view query unless it has data we need. We
> certainly can't avoid merging altogether, but we can make an effort to do
> smart rebalancing later on.
>

I think rebalancing aka shuffling will turn out to be one of those
devil-in-the-details things. Because any document can emit any key, in
the case of rebalancing, if you have to rebuild part of an index due
to node-failure, you'd need to re-request from every other node, any
view rows that might fit in that range. This requires every node to
know about every other node.

If view data is stored with document data, then nodes need only know
about their child nodes, in the tree structure. Recovering from
node-failure is easy: just swap in the failed node's hot-backup, and
regenerate the views on it.

I agree that the cost of merge sort will be ongoing, but I think the
simplicity of this approach at least indicates that we should take it
for the initial work. If we consider rebalancing an optimization, we
can add it later.

I think a better optimization would be to have inner nodes of the tree
lazily cache the view rows of their children. This way the computation
is spread out but the hops for popular queries can be mostly
eliminated.

>
>>
>>  3) Can we agree on a proposed solution to the layout of partition nodes? I
>>> like the tree solution, as long as it is extremely flexible wrt tree depth.
>>>
>>
>> I'm not sure we're ready to do that.  In fact, I think we may need to
>> implement a couple of different topologies and profile them to see what
>> works best.  The tree topology is an interesting idea, but it may
>>
>
> That was a silly question. I didn't expect these questions to be easy. That
> should have read as a discussion prompt rather than a call for consensus.
>
> We should probably clarify the impetus for a tree structure. Computationally
> intensive reduces is the primary use case and the tree is a good way to get
> speedup here. In the case of a map-only view, we still need to merge and
> aggregate the results from each shard. This merge needs to happen somewhere,
> likely either at the node that's servicing the request or recursively up the
> tree. In either case, we agree that there's not much win if every view
> request has to hit every node. Therefore, I think we may need to start
> tracking which updates affect the view index.
>
>
>>
>>  4) Should the consistent hashing algorithm map ids to leaf nodes or just
>>> to
>>> children? I lean toward children because it encapsulates knowledge about
>>> the
>>> layout of subtrees at each tree level.
>>>
>>
>> If the algorithm maps to children, does that mean every document lookup has
>> to traverse the tree?  I'm not sure that's a great idea.  Up to ~100 nodes I
>> think it may be better to have all document lookups take O(1) hops.  I think
>> distributed Erlang can keep a system of that size globally connected without
>> too much trouble.
>

I like the strict tree approach. I'd translate Adam's comment as:
distributed Erlang can probably handle a tree of depth=1, even with
~100 nodes.

>
> Yes, I think you're right. We should definitely optimize for O(1) reads for
> document lookups and hash directly to leaf nodes.
>
> So, we need a consistent hash implementation. I will include this in the
> proposal.
> From there, where should we go?
>
> Thanks in advance,
> Randall
>



-- 
Chris Anderson
http://jchrisa.net
http://couch.io

Re: CouchDB Cluster/Partition GSoC

Posted by Adam Kocoloski <ad...@gmail.com>.
On Apr 1, 2009, at 1:59 AM, Randall Leeds wrote:

> On Sun, Mar 29, 2009 at 22:12, Adam Kocoloski <ko...@apache.org>  
> wrote:
>
>> Hi Randall, cool!  I can chime in on a couple of the questions ...
>>
>
> Adam, thanks for your quick reply and thorough comments. The more  
> people
> chime in on this discussion the better I can make the proposal, both  
> in
> terms of likelihood for acceptance by a mentor/Google and the value  
> of the
> resulting work for the community. I will aim to post a formalized  
> draft of
> the proposal on my GSoC registration page sometime tomorrow and open  
> it up
> for comments. Submission deadline is Friday.

Sounds like a plan to me.

>> 2) What about _all_docs and seq-num?
>>
>> I presume _all_docs gets merged like any other view.   
>> _all_docs_by_seq is a
>> different story.  In the current code the sequence number is  
>> incremented by
>> one for every update.  If we want to preserve that behavior in  
>> partitioned
>> databases we need some sort of consensus algorithm or master server  
>> to
>> choose the next sequence number.  It could easily turn into a  
>> bottleneck or
>> single point-of-failure if we're not careful.
>>
>> The alternatives are to a) abandon the current format for update  
>> sequences
>> in favor of vector clocks or something more opaque, or b) have
>> _all_docs_by_seq be strictly a node-local query.  I'd prefer the  
>> former, as
>> I think it will be useful for e.g. external indexers to treat the
>> partitioned database just like a single server one.  If we do the  
>> latter, I
>> think it means that either the external indexers have to be  
>> installed on
>> every node, or at least they have to be aware of all the partitions.
>
> If at all possible I think we should have the entire partition group  
> appear
> as a single server from the outside. One thing that comes to mind  
> here is a
> question about sequence numbers. Vector clocks only guarantee a  
> partial
> ordering, but I'm under the impression that currently seq numbers  
> have a
> strict ordering.

Yes, that's a good point.  Vector clocks may not be sufficient here.   
On the other hand, do we absolutely need a strict ordering of events?   
If the purpose of these sequence numbers is to ensure that replicators  
and indexers don't miss any updates, can't we just interpret GET  
_all_docs_by_seq as "give me all the updates that *might* have  
happened after X"?  That's a request we can answer with vector clocks,  
it's just the set of all updates such that VC(X') >= VC(X).  Of  
course, it's less efficient in that we may send duplicate updates in a  
write-heavy scenario.

Caveat: I haven't given much thought to how we'd efficiently store old  
versions of the vector clock at all nodes.

> Database sequence numbers are used in replication and in determining  
> whether
> views need refreshing. Anything else I'm missing?

Any external indexers (couchdb-lucene, for instance) also need the  
sequence numbers.

> Currently it seems there is no tracking of which updates actually  
> change a view index (comment on
> line 588 of couch_httpd_view.erl on trunk). Improving this would be  
> a nice
> win. See my answer to number (3).

That's only partially true.  You're right that the Etags aren't yet  
smart enough to know when a view stayed the same.  However, we  
definitely do track relationships between documents and view keys in a  
separate btree -- we have to if we want to expire the old view rows  
when a document is updated.  I think we should eventually be able to  
leverage this information to be smarter about view Etags.

> The easy way to manage seq numbers is to let one node be the write  
> master
> for any cluster. (The root node of any partition group could  
> actually be a
> cluster, but if writes always go through a master the master can  
> maintain
> the global sequence number for the partition group).

Yeah, this scares me a little bit.  I assume by a write master you  
mean a node that's responsible for ordering all of the updates to a  
database, regardless of where those documents are actually stored on  
disk.  I'm sure we can build a performant implementation (it's just a  
counter, after all), but I worry about the availability of such a  
system.  I guess that's what supervision trees are for ... but I'd  
prefer to try to solve these problems in a decentralized manner if  
possible.  My $.02.

>
>>
>> 3) Can we agree on a proposed solution to the layout of partition  
>> nodes? I
>>> like the tree solution, as long as it is extremely flexible wrt  
>>> tree depth.
>>>
>>
>> I'm not sure we're ready to do that.  In fact, I think we may need to
>> implement a couple of different topologies and profile them to see  
>> what
>> works best.  The tree topology is an interesting idea, but it may
>>
>
> That was a silly question. I didn't expect these questions to be  
> easy. That
> should have read as a discussion prompt rather than a call for  
> consensus.
>
> We should probably clarify the impetus for a tree structure.  
> Computationally
> intensive reduces is the primary use case and the tree is a good way  
> to get
> speedup here. In the case of a map-only view, we still need to merge  
> and
> aggregate the results from each shard. This merge needs to happen  
> somewhere,
> likely either at the node that's servicing the request or  
> recursively up the
> tree. In either case, we agree that there's not much win if every view
> request has to hit every node. Therefore, I think we may need to start
> tracking which updates affect the view index.

Good point -- caching the map-only views from leaf nodes could be a  
nice win for the tree structure. It hadn't clicked for me until just  
now.  Best,

Adam

> So, we need a consistent hash implementation. I will include this in  
> the
> proposal.
> From there, where should we go?
>
> Thanks in advance,
> Randall


Re: CouchDB Cluster/Partition GSoC

Posted by Randall Leeds <ra...@gmail.com>.
On Sun, Mar 29, 2009 at 22:12, Adam Kocoloski <ko...@apache.org> wrote:

> Hi Randall, cool!  I can chime in on a couple of the questions ...
>

Adam, thanks for your quick reply and thorough comments. The more people
chime in on this discussion the better I can make the proposal, both in
terms of likelihood for acceptance by a mentor/Google and the value of the
resulting work for the community. I will aim to post a formalized draft of
the proposal on my GSoC registration page sometime tomorrow and open it up
for comments. Submission deadline is Friday.


> On Mar 29, 2009, at 8:59 PM, Randall Leeds wrote:
>
>  1) What's required to make CouchDB a full OTP application? Isn't it using
>> gen_server already?
>>
>
> Yes, in fact CouchDB is already an OTP application using supervisors,
> gen_servers, and gen_events.  There are situations in which it could do a
> better job of adhering to OTP principles, and it could probably also use
> some refactoring to make the partitioning code fit in easily.


Cool. We'll see what we need here as we clarify other issues.


>
>  2) What about _all_docs and seq-num?
>>
>
> I presume _all_docs gets merged like any other view.  _all_docs_by_seq is a
> different story.  In the current code the sequence number is incremented by
> one for every update.  If we want to preserve that behavior in partitioned
> databases we need some sort of consensus algorithm or master server to
> choose the next sequence number.  It could easily turn into a bottleneck or
> single point-of-failure if we're not careful.
>
> The alternatives are to a) abandon the current format for update sequences
> in favor of vector clocks or something more opaque, or b) have
> _all_docs_by_seq be strictly a node-local query.  I'd prefer the former, as
> I think it will be useful for e.g. external indexers to treat the
> partitioned database just like a single server one.  If we do the latter, I
> think it means that either the external indexers have to be installed on
> every node, or at least they have to be aware of all the partitions.


If at all possible I think we should have the entire partition group appear
as a single server from the outside. One thing that comes to mind here is a
question about sequence numbers. Vector clocks only guarantee a partial
ordering, but I'm under the impression that currently seq numbers have a
strict ordering.

Database sequence numbers are used in replication and in determining whether
views need refreshing. Anything else I'm missing? Currently it seems there
is no tracking of which updates actually change a view index (comment on
line 588 of couch_httpd_view.erl on trunk). Improving this would be a nice
win. See my answer to number (3).

The easy way to manage seq numbers is to let one node be the write master
for any cluster. (The root node of any partition group could actually be a
cluster, but if writes always go through a master the master can maintain
the global sequence number for the partition group).


> One other thing that bothers me is the merge-sort required for every view
> lookup.  In *really* large clusters it won't be good if queries for a single
> key in a view have to hit each partition.  We could have an alternative
> structure where each view gets partitioned much like the document data while
> its built.  I worry that a view partitioned in this way may need frequent
> rebalancing during the build, since view keys are probably not going to be
> uniformly distributed.  Nevertheless, I think the benefit of having many
> view queries only hit a small subset of nodes in the cluster is pretty huge.


I agree that the merge-sort is something we need to look at carefully. We
should never hit a node in a view query unless it has data we need. We
certainly can't avoid merging altogether, but we can make an effort to do
smart rebalancing later on.


>
>  3) Can we agree on a proposed solution to the layout of partition nodes? I
>> like the tree solution, as long as it is extremely flexible wrt tree depth.
>>
>
> I'm not sure we're ready to do that.  In fact, I think we may need to
> implement a couple of different topologies and profile them to see what
> works best.  The tree topology is an interesting idea, but it may
>

That was a silly question. I didn't expect these questions to be easy. That
should have read as a discussion prompt rather than a call for consensus.

We should probably clarify the impetus for a tree structure. Computationally
intensive reduces is the primary use case and the tree is a good way to get
speedup here. In the case of a map-only view, we still need to merge and
aggregate the results from each shard. This merge needs to happen somewhere,
likely either at the node that's servicing the request or recursively up the
tree. In either case, we agree that there's not much win if every view
request has to hit every node. Therefore, I think we may need to start
tracking which updates affect the view index.


>
>  4) Should the consistent hashing algorithm map ids to leaf nodes or just
>> to
>> children? I lean toward children because it encapsulates knowledge about
>> the
>> layout of subtrees at each tree level.
>>
>
> If the algorithm maps to children, does that mean every document lookup has
> to traverse the tree?  I'm not sure that's a great idea.  Up to ~100 nodes I
> think it may be better to have all document lookups take O(1) hops.  I think
> distributed Erlang can keep a system of that size globally connected without
> too much trouble.


Yes, I think you're right. We should definitely optimize for O(1) reads for
document lookups and hash directly to leaf nodes.

So, we need a consistent hash implementation. I will include this in the
proposal.
>From there, where should we go?

Thanks in advance,
Randall

Re: CouchDB Cluster/Partition GSoC

Posted by Adam Kocoloski <ko...@apache.org>.
Hi Randall, cool!  I can chime in on a couple of the questions ...

On Mar 29, 2009, at 8:59 PM, Randall Leeds wrote:

> 1) What's required to make CouchDB a full OTP application? Isn't it  
> using gen_server already?

Yes, in fact CouchDB is already an OTP application using supervisors,  
gen_servers, and gen_events.  There are situations in which it could  
do a better job of adhering to OTP principles, and it could probably  
also use some refactoring to make the partitioning code fit in easily.

> 2) What about _all_docs and seq-num?

I presume _all_docs gets merged like any other view.  _all_docs_by_seq  
is a different story.  In the current code the sequence number is  
incremented by one for every update.  If we want to preserve that  
behavior in partitioned databases we need some sort of consensus  
algorithm or master server to choose the next sequence number.  It  
could easily turn into a bottleneck or single point-of-failure if  
we're not careful.

The alternatives are to a) abandon the current format for update  
sequences in favor of vector clocks or something more opaque, or b)  
have _all_docs_by_seq be strictly a node-local query.  I'd prefer the  
former, as I think it will be useful for e.g. external indexers to  
treat the partitioned database just like a single server one.  If we  
do the latter, I think it means that either the external indexers have  
to be installed on every node, or at least they have to be aware of  
all the partitions.

One other thing that bothers me is the merge-sort required for every  
view lookup.  In *really* large clusters it won't be good if queries  
for a single key in a view have to hit each partition.  We could have  
an alternative structure where each view gets partitioned much like  
the document data while its built.  I worry that a view partitioned in  
this way may need frequent rebalancing during the build, since view  
keys are probably not going to be uniformly distributed.   
Nevertheless, I think the benefit of having many view queries only hit  
a small subset of nodes in the cluster is pretty huge.

> 3) Can we agree on a proposed solution to the layout of partition  
> nodes? I like the tree solution, as long as it is extremely flexible  
> wrt tree depth.

I'm not sure we're ready to do that.  In fact, I think we may need to  
implement a couple of different topologies and profile them to see  
what works best.  The tree topology is an interesting idea, but it may  
turn out that passing view results up the tree is slower than just  
sending them directly to the final destination and having that server  
do the rest of the work.  Off the cuff I think trees may be a great  
choice for computationally intensive reduce functions, but other views  
where the size of the data is large relative to the computational  
requirements may be better off minimizing the number of copies of the  
data that need to be made.

> 4) Should the consistent hashing algorithm map ids to leaf nodes or  
> just to
> children? I lean toward children because it encapsulates knowledge  
> about the
> layout of subtrees at each tree level.

If the algorithm maps to children, does that mean every document  
lookup has to traverse the tree?  I'm not sure that's a great idea.   
Up to ~100 nodes I think it may be better to have all document lookups  
take O(1) hops.  I think distributed Erlang can keep a system of that  
size globally connected without too much trouble.

Cheers, Adam