You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@jackrabbit.apache.org by Marcel Reutegger <mr...@adobe.com> on 2012/03/01 12:05:46 UTC

[jr3] clustering

Hi,

recently I was thinking about an approaches to implement
clustering in JR3 and I'd like to know what you think about it.

Distributed B-tree

the clustering ideas we discusses so far [0] partition the tree at
given paths similar to mount points in a file system. I think
the drawback of those approaches is that they are rather static
and you'd probably have to define where those mount points
are. to me it seems rather difficult to setup a system that is
able to grow when the repository size increases.

so, I was thinking of something similar as described in this
paper [1] or similar [2]. since a B-tree is basically an ordered
list of items we'd have to linearize the JCR or MK hierarchy. I'm
not sure whether a depth or breadth first traversal is
better suited. maybe there even exists a more sophisticated
space filling curve, which is a combination of both. linearizing
the hierarchy on a B-tree should give some since locality for
nodes that are hierarchically close and probability is high that
they are requested in succession. the algorithm discussed in
[1] then distributes the nodes of the B-tree randomly to
machines (at least in the most simply case). note, that a
B-tree node may contain many JCR/MK nodes. with a reasonable
size for a B-tree node, we'd get a good balance of locality
and distribution to multiple servers.

the algorithm uses optimistic concurrency control and two
phase commits when changes need to be applied on multiple
servers. I'm not a big fan of two phase commit, but if we assume
that most commits are rather small, changes will be applied
to a single server and a one phase commit is sufficient. two
phase commits would only be necessary for e.g. larger
imports.

Open questions:

how does MVCC fit into this? multiple revisions of the same
JCR/MK node could be stored on a B-tree node. whenever
an update happens the garbage collection could kick in an
purge outdated revisions. providing a consistent journal across
all servers is not clear to me right now.

How does backup work? this is quite tricky because it is
difficult to get a consistent snapshot of the distributed
tree.

Regards
Marcel

[0] https://docs.google.com/presentation/pub?id=131sVx5s58jAKE2FSVBfUZVQSl1W820_syyzLYRHGH6E&start=false&loop=false&delayms=3000#slide=id.g4272a65_0_39
[1] http://www.hpl.hp.com/techreports/2007/HPL-2007-193.pdf
[2] http://research.microsoft.com/en-us/people/aguilera/distributed-btree-vldb2008.pdf

RE: [jr3] clustering

Posted by Marcel Reutegger <mr...@adobe.com>.

> so, I was thinking of something similar as described in this
> paper [1] or similar [2]. since a B-tree is basically an ordered
> list of items we'd have to linearize the JCR or MK hierarchy. I'm
> not sure whether a depth or breadth first traversal is
> better suited. maybe there even exists a more sophisticated
> space filling curve, which is a combination of both.

I was thinking about this a bit more and came to the conclusion
that a breadth first linearization works better for the usage
imposed by the JCR API (direct lookup and looping through
child nodes).

Regards
 Marcel

RE: [jr3] clustering

Posted by Marcel Reutegger <mr...@adobe.com>.

> >http://en.wikipedia.org/wiki/Incremental_encoding
> 
> OK, but I don't know how it could be used as the node id. Where would you
> keep the lookup table?

lookup is done using the previously proposed distributed B+Tree, where
the path of a node is used as the key.

Regards
 Marcel

Re: [jr3] clustering

Posted by Thomas Mueller <mu...@adobe.com>.

Hi,

>http://en.wikipedia.org/wiki/Incremental_encoding

OK, but I don't know how it could be used as the node id. Where would you
keep the lookup table?

>>MongoDB supports two-phase commit, but I would only use it for special
>> cases like locking.
>
>do you mean this: 
>http://cookbook.mongodb.org/patterns/perform-two-phase-commits/ ?

Yes... now I see it's not "really" two-phase-commit.

Regards,
Thomas

RE: [jr3] clustering

Posted by Marcel Reutegger <mr...@adobe.com>.

> >>The node id
> >> could be the complete path (which isn't terribly efficient, but avoids
> >> problems).
> >
> >right. but with delta encoding the space utilization will be better.
> >given the locality of JCR/MK nodes in a low level block, you will
> >have very similar paths that share a long prefix.
> 
> Sorry I don't know what you mean with delta encoding, could you give an
> example?

OK, delta encoding seems to be a too generic term. what I mean is: http://en.wikipedia.org/wiki/Incremental_encoding

> >>This sounds like a promising approach for MongoDB.
> >
> >MongoDB only gives you atomic guarantees on the document level.
> >how would you implement a commit for multiple JCR nodes?
> 
> Do we need to guarantee all commits to be atomic (all or nothing) in a
> cluster?
> 
> MongoDB supports two-phase commit, but I would only use it for special
> cases like locking.

do you mean this: http://cookbook.mongodb.org/patterns/perform-two-phase-commits/ ?

regards
 marcel

Re: [jr3] clustering

Posted by Thomas Mueller <mu...@adobe.com>.

Hi,

>>The node id
>> could be the complete path (which isn't terribly efficient, but avoids
>> problems). 
>
>right. but with delta encoding the space utilization will be better.
>given the locality of JCR/MK nodes in a low level block, you will
>have very similar paths that share a long prefix.

Sorry I don't know what you mean with delta encoding, could you give an
example?

>>This sounds like a promising approach for MongoDB.
>
>MongoDB only gives you atomic guarantees on the document level.
>how would you implement a commit for multiple JCR nodes?

Do we need to guarantee all commits to be atomic (all or nothing) in a
cluster? 

MongoDB supports two-phase commit, but I would only use it for special
cases like locking.

Regards,
Thomas

RE: [jr3] clustering

Posted by Marcel Reutegger <mr...@adobe.com>.

Hi,

> The node id
> could be the complete path (which isn't terribly efficient, but avoids
> problems). 

right. but with delta encoding the space utilization will be better.
given the locality of JCR/MK nodes in a low level block, you will
have very similar paths that share a long prefix.

> This sounds like a promising approach for MongoDB.

MongoDB only gives you atomic guarantees on the document level.
how would you implement a commit for multiple JCR nodes?

Regards
 Marcel

RE: [jr3] clustering

Posted by Marcel Reutegger <mr...@adobe.com>.

Hi,

> Hmm I see. I came up with a similar approach loooong time ago. Even
> before the Microkernel. Anyway, I think the Microkernel API does not
> mandate root node versions corresponding to revisions. In fact I think
> the approach you are proposing will scale better wrt. write contention
> on the root node since there is no need for writing a new root node on
> every write operation. However, getting a consistent journal across
> cluster nodes seems more difficult here as you said.

what about consistency requirements? does the journal have to be
exactly in sync with committed changes? that is, does a MK implementation
have to guarantee that a commit immediately becomes available
in the journal related methods?

> >>> How does backup work? this is quite tricky because it is
> >>> difficult to get a consistent snapshot of the distributed
> >>> tree.
> >>
> >> MVCC should make that easy: just make a backup of the head revision at
> >> that time.
> >
> > hmm, I'm not sure that will scale. consider a large repository
> > where traversing all nodes takes a long time.
> >
> > I think backup should be supported at a lower level to be
> > efficient.
> 
> Hmm right, that makes sense.

I was reading up on backup and how it is done in sinfonia. it
basically use a transactional stop (or rather pause?) the world
approach where all pending changes are saved to the disk
image and all further changes are written to the forward log
only, until backup is finished.

regards
 marcel

Re: [jr3] clustering

Posted by Thomas Mueller <mu...@adobe.com>.

Hi,

>>I don't see a need why the parent node needs to be updated
>> when a child node is added, removed or updated.
>
>However, getting a consistent journal across
>cluster nodes seems more difficult here as you said.

One option might be to persist the journal within or near the node data
itself. I guess move and re-order operations could be modeled as
delete+add (only on the storage level; not in the journal). The node id
could be the complete path (which isn't terribly efficient, but avoids
problems). This sounds like a promising approach for MongoDB.

Regards,
Thomas

Re: [jr3] clustering

Posted by Michael Dürig <md...@apache.org>.

>>> how does MVCC fit into this? multiple revisions of the same
>>> JCR/MK node could be stored on a B-tree node. whenever
>>> an update happens the garbage collection could kick in an
>>> purge outdated revisions. providing a consistent journal across
>>> all servers is not clear to me right now.
>>
>> I think MVCC is not a problem as such. To the contrary, since it is
>> append only it should even be less problematic. IMO garbage collection
>> is an entirely different story and we shouldn't worry too much about it
>> until we have a good working model for clustering itself.
>>
>> Wrt. the journal: isn't that just the list of versions of the root node?
>> This should be for free then. But I think I'm missing something here...
>
> the model I have in mind doesn't have root node versions that
> correspond to MK revisions. Is this mandated somehow by the MK
> API design?
>
> in my model only the nodes that changed get new revisions.
> and reading from the tree with a given revision means it
> will pick the revision which is less or equal to the given revision.
>
> e.g. if you have a node /a/b/c which was changed three times
> in revision 2, 7, and 12 and a client reads at revision 9. the
> implementation will return revision 7.
>
> I don't see a need why the parent node needs to be updated
> when a child node is added, removed or updated.

Hmm I see. I came up with a similar approach loooong time ago. Even 
before the Microkernel. Anyway, I think the Microkernel API does not 
mandate root node versions corresponding to revisions. In fact I think 
the approach you are proposing will scale better wrt. write contention 
on the root node since there is no need for writing a new root node on 
every write operation. However, getting a consistent journal across 
cluster nodes seems more difficult here as you said.

>
>>> How does backup work? this is quite tricky because it is
>>> difficult to get a consistent snapshot of the distributed
>>> tree.
>>
>> MVCC should make that easy: just make a backup of the head revision at
>> that time.
>
> hmm, I'm not sure that will scale. consider a large repository
> where traversing all nodes takes a long time.
>
> I think backup should be supported at a lower level to be
> efficient.

Hmm right, that makes sense.

Michael

>
> e.g. something like proposed in [0] 4.9.
>
> regards
>   marcel
>
> [0] http://cs.ucla.edu/~kohler/class/08w-dsi/aguilera07sinfonia.pdf
>

RE: [jr3] clustering

Posted by Marcel Reutegger <mr...@adobe.com>.

Hi,

> An open question though is how replication would fit into this picture.
> There is some mention in the paper about backup nodes for fail-over. Not
> sure if that is what we are aiming for or whether we want to go beyond
> that.

I think what they use is a backup server, which is kept in sync and can act
as a fail over. the strict synchronization does look a bit troublesome. maybe
this could be relaxed when a commit only contains changes for a single
server.

> The paper assumes network connections to be reliable (i.e. no messages
> altered, dropped or duplicated). However there is no mention on how the
> system would recover from a partitioned network. That is, how it would
> recover when some links go down and come up later. However, since it
> uses 2 phase commit, I think it would basically inherit that behaviour
> which means cluster nodes could become blocked (See [1] proposition 7.1
> and 7.2).

writes never block in that system, they would simply fail until the pending
transaction is either committed or aborted. IIUC reads will never block. 

> OTOH the combination of optimistic locking during the transaction itself
> and pessimistic locking only for the commit itself will probably result
> in very good write throughput. Even more so since probably in many cases
> there is only a single node involved in the transaction such that a
> simple commit suffices.
> 
> More comments see inline below.
> 
> [1] http://research.microsoft.com/en-us/people/philbe/ccontrol.aspx
> 
> On 1.3.12 11:05, Marcel Reutegger wrote:
> 
> [...]
> >
> > so, I was thinking of something similar as described in this
> > paper [1] or similar [2]. since a B-tree is basically an ordered
> > list of items we'd have to linearize the JCR or MK hierarchy. I'm
> > not sure whether a depth or breadth first traversal is
> > better suited. maybe there even exists a more sophisticated
> > space filling curve, which is a combination of both. linearizing
> > the hierarchy on a B-tree should give some since locality for
> > nodes that are hierarchically close and probability is high that
> > they are requested in succession.
> 
> Node types may give hints here. As long as they are not recursive (i.e.
> nt:hierarchy) node types usually define "things that belong together".
> 
> [...]
> 
> > Open questions:
> >
> > how does MVCC fit into this? multiple revisions of the same
> > JCR/MK node could be stored on a B-tree node. whenever
> > an update happens the garbage collection could kick in an
> > purge outdated revisions. providing a consistent journal across
> > all servers is not clear to me right now.
> 
> I think MVCC is not a problem as such. To the contrary, since it is
> append only it should even be less problematic. IMO garbage collection
> is an entirely different story and we shouldn't worry too much about it
> until we have a good working model for clustering itself.
> 
> Wrt. the journal: isn't that just the list of versions of the root node?
> This should be for free then. But I think I'm missing something here...

the model I have in mind doesn't have root node versions that
correspond to MK revisions. Is this mandated somehow by the MK
API design?

in my model only the nodes that changed get new revisions.
and reading from the tree with a given revision means it
will pick the revision which is less or equal to the given revision.

e.g. if you have a node /a/b/c which was changed three times
in revision 2, 7, and 12 and a client reads at revision 9. the
implementation will return revision 7.

I don't see a need why the parent node needs to be updated
when a child node is added, removed or updated.

> > How does backup work? this is quite tricky because it is
> > difficult to get a consistent snapshot of the distributed
> > tree.
> 
> MVCC should make that easy: just make a backup of the head revision at
> that time.

hmm, I'm not sure that will scale. consider a large repository
where traversing all nodes takes a long time.

I think backup should be supported at a lower level to be
efficient.

e.g. something like proposed in [0] 4.9.

regards
 marcel

[0] http://cs.ucla.edu/~kohler/class/08w-dsi/aguilera07sinfonia.pdf

Re: [jr3] clustering

Posted by Alexander Klimetschek <ak...@adobe.com>.

On 01.03.2012, at 14:50, Michael Dürig wrote:

> Nice to see new ideas wrt. to clustering. I quickly skimmed over the HP 
> paper and think that approach sounds promising. Having a flexible 
> partitioning scheme which also supports dynamic repartioning (i.e. 
> additions of server, migration of nodes) seems a better approach to me 
> than fixed partitioning by path as we discussed earlier.

FWIW: I think quite often you can explicitly model the partitions on the application level and gain advantages from that (like one machine for the first 100k user home directories). Such modeling should be easily possible.

Cheers,
Alex

-- 
Alexander Klimetschek
Developer // Adobe (Day) // Berlin - Basel

Re: [jr3] clustering

Posted by Thomas Mueller <mu...@adobe.com>.

Hi,

I didn't read all the details, but I think this is a promising approach.
Partitioning is flexible, so cluster nodes can be added and removed at
runtime without downtime. I believe we could use the 'virtual repository'
approach as the base for this implementation. That means, we first build a
solution with fixed and manual partitioning, and once we have that we can
extend it towards flexible partitioning.

It seems failover and synchronizing changes (for content that is stored in
multiple cluster nodes) would need to be solved in another way, which also
fits well into the picture.

Regards,
Thomas



and performance should be very good.

On 3/1/12 2:50 PM, "Michael Dürig" <md...@apache.org> wrote:

>Hi Marcel,
>
>Nice to see new ideas wrt. to clustering. I quickly skimmed over the HP
>paper and think that approach sounds promising. Having a flexible
>partitioning scheme which also supports dynamic repartioning (i.e.
>additions of server, migration of nodes) seems a better approach to me
>than fixed partitioning by path as we discussed earlier.
>
>An open question though is how replication would fit into this picture.
>There is some mention in the paper about backup nodes for fail-over. Not
>sure if that is what we are aiming for or whether we want to go beyond
>that.
>
>The paper assumes network connections to be reliable (i.e. no messages
>altered, dropped or duplicated). However there is no mention on how the
>system would recover from a partitioned network. That is, how it would
>recover when some links go down and come up later. However, since it
>uses 2 phase commit, I think it would basically inherit that behaviour
>which means cluster nodes could become blocked (See [1] proposition 7.1
>and 7.2).
>
>OTOH the combination of optimistic locking during the transaction itself
>and pessimistic locking only for the commit itself will probably result
>in very good write throughput. Even more so since probably in many cases
>there is only a single node involved in the transaction such that a
>simple commit suffices.
>
>More comments see inline below.
>
>[1] http://research.microsoft.com/en-us/people/philbe/ccontrol.aspx
>
>On 1.3.12 11:05, Marcel Reutegger wrote:
>
>[...]
>>
>> so, I was thinking of something similar as described in this
>> paper [1] or similar [2]. since a B-tree is basically an ordered
>> list of items we'd have to linearize the JCR or MK hierarchy. I'm
>> not sure whether a depth or breadth first traversal is
>> better suited. maybe there even exists a more sophisticated
>> space filling curve, which is a combination of both. linearizing
>> the hierarchy on a B-tree should give some since locality for
>> nodes that are hierarchically close and probability is high that
>> they are requested in succession.
>
>Node types may give hints here. As long as they are not recursive (i.e.
>nt:hierarchy) node types usually define "things that belong together".
>
>[...]
>
>> Open questions:
>>
>> how does MVCC fit into this? multiple revisions of the same
>> JCR/MK node could be stored on a B-tree node. whenever
>> an update happens the garbage collection could kick in an
>> purge outdated revisions. providing a consistent journal across
>> all servers is not clear to me right now.
>
>I think MVCC is not a problem as such. To the contrary, since it is
>append only it should even be less problematic. IMO garbage collection
>is an entirely different story and we shouldn't worry too much about it
>until we have a good working model for clustering itself.
>
>Wrt. the journal: isn't that just the list of versions of the root node?
>This should be for free then. But I think I'm missing something here...
>
>>
>> How does backup work? this is quite tricky because it is
>> difficult to get a consistent snapshot of the distributed
>> tree.
>
>MVCC should make that easy: just make a backup of the head revision at
>that time.
>
>Michael
>
>>
>> Regards
>>   Marcel
>>
>>
>> [0] 
>>https://docs.google.com/presentation/pub?id=131sVx5s58jAKE2FSVBfUZVQSl1W8
>>20_syyzLYRHGH6E&start=false&loop=false&delayms=3000#slide=id.g4272a65_0_3
>>9
>> [1] http://www.hpl.hp.com/techreports/2007/HPL-2007-193.pdf
>> [2] 
>>http://research.microsoft.com/en-us/people/aguilera/distributed-btree-vld
>>b2008.pdf
>>

Re: [jr3] clustering

Posted by Michael Dürig <md...@apache.org>.

Hi Marcel,

Nice to see new ideas wrt. to clustering. I quickly skimmed over the HP 
paper and think that approach sounds promising. Having a flexible 
partitioning scheme which also supports dynamic repartioning (i.e. 
additions of server, migration of nodes) seems a better approach to me 
than fixed partitioning by path as we discussed earlier.

An open question though is how replication would fit into this picture. 
There is some mention in the paper about backup nodes for fail-over. Not 
sure if that is what we are aiming for or whether we want to go beyond 
that.

The paper assumes network connections to be reliable (i.e. no messages 
altered, dropped or duplicated). However there is no mention on how the 
system would recover from a partitioned network. That is, how it would 
recover when some links go down and come up later. However, since it 
uses 2 phase commit, I think it would basically inherit that behaviour 
which means cluster nodes could become blocked (See [1] proposition 7.1 
and 7.2).

OTOH the combination of optimistic locking during the transaction itself 
and pessimistic locking only for the commit itself will probably result 
in very good write throughput. Even more so since probably in many cases 
there is only a single node involved in the transaction such that a 
simple commit suffices.

More comments see inline below.

[1] http://research.microsoft.com/en-us/people/philbe/ccontrol.aspx

On 1.3.12 11:05, Marcel Reutegger wrote:

[...]
>
> so, I was thinking of something similar as described in this
> paper [1] or similar [2]. since a B-tree is basically an ordered
> list of items we'd have to linearize the JCR or MK hierarchy. I'm
> not sure whether a depth or breadth first traversal is
> better suited. maybe there even exists a more sophisticated
> space filling curve, which is a combination of both. linearizing
> the hierarchy on a B-tree should give some since locality for
> nodes that are hierarchically close and probability is high that
> they are requested in succession.

Node types may give hints here. As long as they are not recursive (i.e. 
nt:hierarchy) node types usually define "things that belong together".

[...]

> Open questions:
>
> how does MVCC fit into this? multiple revisions of the same
> JCR/MK node could be stored on a B-tree node. whenever
> an update happens the garbage collection could kick in an
> purge outdated revisions. providing a consistent journal across
> all servers is not clear to me right now.

I think MVCC is not a problem as such. To the contrary, since it is 
append only it should even be less problematic. IMO garbage collection 
is an entirely different story and we shouldn't worry too much about it 
until we have a good working model for clustering itself.

Wrt. the journal: isn't that just the list of versions of the root node? 
This should be for free then. But I think I'm missing something here...

>
> How does backup work? this is quite tricky because it is
> difficult to get a consistent snapshot of the distributed
> tree.

MVCC should make that easy: just make a backup of the head revision at 
that time.

Michael

>
> Regards
>   Marcel
>
>
> [0] https://docs.google.com/presentation/pub?id=131sVx5s58jAKE2FSVBfUZVQSl1W820_syyzLYRHGH6E&start=false&loop=false&delayms=3000#slide=id.g4272a65_0_39
> [1] http://www.hpl.hp.com/techreports/2007/HPL-2007-193.pdf
> [2] http://research.microsoft.com/en-us/people/aguilera/distributed-btree-vldb2008.pdf
>