You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@jackrabbit.apache.org by Michael Dürig <md...@apache.org> on 2012/02/03 19:44:30 UTC

[jr3 microkernel] Change log consolidation

Hi,

While working on a new transient space implementation [1] I discovered 
that recreating a list of changes from transient modifications to the 
hierarchy is like opening a can of worms. Rethinking things it turned 
out, that it is much simpler to directly consolidate the change log. 
This approach is much more flexible and versatile since knowledge of the 
persisted hierarchy is not necessary at all. Thus this approach should 
work equally well with change logs for jr3, the Microkernel, Jackrabbit 
core, JCR and SPI.

The algorithm is capable of creating minimal change logs from any valid 
change log (that is from any change log that can be applied to some 
hierarchy).

I extensively tested the algorithm on randomly created change logs [3]. 
This resulted in some interesting numbers: consolidation results in an 
decrease of about 45% of these randomly generated change logs. 
Furthermore the resulting size of the H2 database files *decreased up to 
200%*!

The implementation [2] is currently part of the jackrabbit-microkernel 
module in the sandbox. However it does not have any crucial dependencies 
so it should be easy to adapt for other usages.

The gory details of the algorithm are based on some algebraic properties 
of move operations on paths as detailed in the following paragraphs.

The change log consolidation algorithm implemented in the reduce method 
[2] is based on algebraic properties of move operations on paths. The 
other operations (add node, remove node and set property) are 
generalized to move operations. Consolidation relies on reduction and 
commutation rules of move operations.

A move operation resembles a map on a hierarchy (of nodes and 
properties). A change log consisting of k move operations m_1 to m_k is 
thus the composition of the individual moves: m_1 *, ..., * m_k (using * 
for function composition: f(g(x)) = (f * g)(x)).

Some definitions, notation and propositions:

* Let NIL denote a path which never occurs in any hierarchy.

* Order on paths: let p, q be paths.
   - p < q iff p != NIL and q != NIL and p is an ancestor of q.
   - p <= q iff p < q or p == q != NIL

* Conflict of paths: let p, q be paths.
   - p ~ q (p conflicts with q) iff either p <= q or q <= p

* Substitution in paths: let p, q, r be paths.
   - [p -> q]r = r if p is not an ancestor of r and
   - [p -> q]r = s where s is the path resulting from replacing
     the ancestor p in r with q otherwise.

* Let p, q be paths. Then p:q denotes a move operation where the
   node at p is moved to a new node at q.

* Valid moves: leq p, q be paths.
   - p:q is valid iff p !~ q or p = q
   - if p:q is not valid, it is invalid

Invalid moves are exactly those which would result in a node being moved 
to an ancestor/descendant of itself.

* Identity on moves: let p, q be paths.
   - p:q = ID iff p = q.

* Conflict on moves: let p, q, r, s be paths.
   - p:q ~ r:s (p:q conflicts with r:s) iff either p ~ r or p ~ s
     or q ~ r or q ~ s.

* Strict commutativity of moves: let m, n be moves.
   - m * n = n * m iff m !~ n

* Substitutions in moves: let p, q, r, s be paths.
   - [p -> q]r:s = [p -> q]r:[p -> q]s

* Let p be a path and let +p denote an add node operation and let -p
   denote a remove node operation for a node at path p.
   - +p = NIL:p That is, adding a node is represented by a move from a
     unknown source.
   - p = p:NIL. That is, removing a node is represented by a move to an
     unknown sink.


Let m = p:q, n = r:s with p, q, r, s != NIL be valid moves with m != ID 
and n != ID. Then the following reduction and commutation rules apply:

1.  p!~ r:  m * n = n * m
2.  p < r:  illegal (since this implies q <= r which implies p ~ q and
             thus m invalid)
3.  p = r:  illegal (since this implies q <= r which implies p ~ q and
             this m invalid)
4.  p > r:  does not commute if q < s. Otherwise m * n = n * [r -> s]m
5.  p!~ s:  m * n = n * m
6.  p < s:  illegal (since this implies p ~ q and thus m invalid)
7.  p = s:  does not commute
8.  p > s:  illegal (since p > s implies there is an s already which
             will conflict with r:s)
9.  q!~ r:  m * n = n * m
10. q < r:  m * n = [q -> p]n * m
11. q = r:  m * n = p:s (transitivity of moves)
12. q > r:  m * n = n * [r -> s]m
13. q!~ s:  m * n = n * m
14. q < s:  does not commute if p > r. Otherwise m * n = [q -> p]n * m
15. q = s:  illegal (since s conflicts with r:s)
16. q > s:  illegal (since s conflicts with r:s)

Allowing add node and remove node operations the following additional 
conditions apply:

Let m = p:q, n = r:s be valid moves with m != ID and n != ID. Then the 
reduction and commutations rules 1. to 16. apply with extra conditions 
on 4., 10., 12. and 14.:

4'.  if s = NIL and q = NIL then m * n = -r. Otherwise if s = NIL then
      m, n do not commute.
10'. illegal if p = NIL
12'. if s = NIL then m * n = -r * -p
14'. illegal if p = NIL

Michael


[1] http://markmail.org/message/qkkcvtmtapas2cx4
[2] 
http://svn.apache.org/viewvc/jackrabbit/sandbox/jackrabbit-microkernel/src/main/java/org/apache/jackrabbit/state/ChangeLog.java?view=markup
[3] 
http://svn.apache.org/viewvc/jackrabbit/sandbox/jackrabbit-microkernel/src/test/java/org/apache/jackrabbit/state/ChangeLogFuzzTest.java?view=markup

Re: [jr3 microkernel] Change log consolidation

Posted by Michael Dürig <md...@apache.org>.

>>>> Let m = p:q, n = r:s with p, q, r, s != NIL be valid moves with m != ID
>>>> and
>>>> n != ID. Then the following reduction and commutation rules apply:
>>>>
>>>> 1.  p!~ r:  m * n = n * m
>>>> 2.  p<    r:  illegal (since this implies q<= r which implies p ~ q and
>>>>             thus m invalid)
>>>> 3.  p = r:  illegal (since this implies q<= r which implies p ~ q and
>>>>             this m invalid)
>>>> 4.  p>    r:  does not commute if q<    s. Otherwise m * n = n * [r ->    s]m
>>>> 5.  p!~ s:  m * n = n * m
>>>> 6.  p<    s:  illegal (since this implies p ~ q and thus m invalid)
>>>> 7.  p = s:  does not commute
>>>> 8.  p>    s:  illegal (since p>    s implies there is an s already which
>>>>             will conflict with r:s)
>>>> 9.  q!~ r:  m * n = n * m
>>>> 10. q<    r:  m * n = [q ->    p]n * m
>>>> 11. q = r:  m * n = p:s (transitivity of moves)
>>>> 12. q>    r:  m * n = n * [r ->    s]m
>>>> 13. q!~ s:  m * n = n * m
>>>> 14. q<    s:  does not commute if p>    r. Otherwise m * n = [q ->    p]n * m
>>>> 15. q = s:  illegal (since s conflicts with r:s)
>>>> 16. q>    s:  illegal (since s conflicts with r:s)
>>>>
>>>> Allowing add node and remove node operations the following additional
>>>> conditions apply:
>>>>
>>>> Let m = p:q, n = r:s be valid moves with m != ID and n != ID. Then the
>>>> reduction and commutations rules 1. to 16. apply with extra conditions on
>>>> 4., 10., 12. and 14.:
>>>>
>>>> 4'.  if s = NIL and q = NIL then m * n = -r. Otherwise if s = NIL then
>>>>      m, n do not commute.
>>>> 10'. illegal if p = NIL
>>>> 12'. if s = NIL then m * n = -r * -p
>>>> 14'. illegal if p = NIL
>>>
>>
>>> could you please provide a simple example?
>>

Here are simple examples for each of the above cases and its respective 
special cases. Note that I refined 14'. to

14'. if p = NIL: does not commute if the parent of s is q. Illegal otherwise

1.   >/a:/b >/c:/d     =  >/c:/d >/a:b
2.   >/a:/b >/a/b:/c      illegal
3.   >/a:/b >/a:/c        illegal
4.   >/a/b:/c >/a:/d   =  >/a:/d >/d/b:/c
4.   >/a/b:/c >/a:/c/d    does not commute  (q < s)
4'.  -/a/b -/a         =  -/a               (s = NIL and q = NIL)
4'.  >/a/b:/c -/a      =  does not commute  (s = NIL)
5.   >/a:/b >/c:/d     =  >/c:/d >/a:b
6.   >/a:/b >/c:/a/d      illegal
7.   >/a:/b >/c:/a        does not commute
8.   >/a/d:/b >/c:/a      illegal
9.   >/a:/b >/c:/d     =  >/c:/d >/a:b
10.  >/a:/b >/b/c:/d   =  >/a/c:/d >/a:/b
10'. +/b:{} >/b/c:/       illegal
11.  >/a:/b >/b:/c     =  >/a:/c
12.  >/a:/b/c >/b:/d   =  >/b:/d >/a:/d/c
12'. >/a:/b/c -/b      =  -/b -/a
13:  >/a:/b >/c:/d     =  >/c:/d >/a:b
14.  >/a:/b >/c:/b/d   =  >/c:/a/d >/a:/b
14.  >/a/b:/b >/a:/b/d    does not commute  (p > r)
14'. +/b:{} >/c:/b/c      does not commute  (parent of s = q and p = NIL)
14'. +/b:{} >/c:/b/c/d    illegal           (p = NIL)
15.  >/a:/b >/c:/b        illegal
16.  >/a:/b/d >/c:/b      illegal

Have fun ;-)

Michael

Re: [jr3 microkernel] Change log consolidation

Posted by Stefan Guggisberg <st...@gmail.com>.

On Mon, Feb 6, 2012 at 3:23 PM, Michael Dürig <md...@apache.org> wrote:
>
>
>>>
>>> - Much simpler.
>>
>>
>> erm, i beg to differ ... ;)
>
>
> So do I ;-)
>
>
>>> Apart from the benefits, it is a plain necessity for transient space
>>> implementations on top of the Microkernel: without proper consolidation
>>> users could end up not being able to save their (valid) transient
>>> modifications. Consider a first user doing
>>>
>>>> /a:/t/x +/t/x/y:{}>/t/x:/b
>>>
>>>
>>> if in the meanwhile another user removes x the save operation for the
>>> first
>>> user would fail.
>>
>>
>> how could another user remove /t/x? it only exists in the first user's
>> transient space.
>
>
> Sorry typo. My last sentence should read: "if in the meanwhile another user
> removes *+/t* the save operation for the first user would fail.

ok, that's a valid point.

cheers
stefan

>
> Michael
>
>
>>
>> cheers
>> stefan
>>
>>>
>>> After consolidation, above change log would look like
>>>
>>> +/a/y:{}>/a:/b
>>>
>>> and can be saved.
>>>
>>> Michael
>>>
>>>
>>>>
>>>> cheers
>>>> stefan
>>>>
>>>>>
>>>>> Michael
>>>>>
>>>>>
>>>>> [1] http://markmail.org/message/qkkcvtmtapas2cx4
>>>>> [2]
>>>>>
>>>>>
>>>>> http://svn.apache.org/viewvc/jackrabbit/sandbox/jackrabbit-microkernel/src/main/java/org/apache/jackrabbit/state/ChangeLog.java?view=markup
>>>>> [3]
>>>>>
>>>>>
>>>>> http://svn.apache.org/viewvc/jackrabbit/sandbox/jackrabbit-microkernel/src/test/java/org/apache/jackrabbit/state/ChangeLogFuzzTest.java?view=markup

Re: [jr3 microkernel] Change log consolidation

Posted by Michael Dürig <md...@apache.org>.


>>
>> - Much simpler.
>
> erm, i beg to differ ... ;)

So do I ;-)

>> Apart from the benefits, it is a plain necessity for transient space
>> implementations on top of the Microkernel: without proper consolidation
>> users could end up not being able to save their (valid) transient
>> modifications. Consider a first user doing
>>
>>> /a:/t/x +/t/x/y:{}>/t/x:/b
>>
>> if in the meanwhile another user removes x the save operation for the first
>> user would fail.
>
> how could another user remove /t/x? it only exists in the first user's
> transient space.

Sorry typo. My last sentence should read: "if in the meanwhile another 
user removes *+/t* the save operation for the first user would fail.

Michael

>
> cheers
> stefan
>
>>
>> After consolidation, above change log would look like
>>
>> +/a/y:{}>/a:/b
>>
>> and can be saved.
>>
>> Michael
>>
>>
>>>
>>> cheers
>>> stefan
>>>
>>>>
>>>> Michael
>>>>
>>>>
>>>> [1] http://markmail.org/message/qkkcvtmtapas2cx4
>>>> [2]
>>>>
>>>> http://svn.apache.org/viewvc/jackrabbit/sandbox/jackrabbit-microkernel/src/main/java/org/apache/jackrabbit/state/ChangeLog.java?view=markup
>>>> [3]
>>>>
>>>> http://svn.apache.org/viewvc/jackrabbit/sandbox/jackrabbit-microkernel/src/test/java/org/apache/jackrabbit/state/ChangeLogFuzzTest.java?view=markup

Re: [jr3 microkernel] Change log consolidation

Posted by Stefan Guggisberg <st...@gmail.com>.

On Sun, Feb 5, 2012 at 6:25 PM, Michael Dürig <md...@apache.org> wrote:
>
>
> On 4.2.12 20:07, Stefan Guggisberg wrote:
>>
>> On Fri, Feb 3, 2012 at 7:44 PM, Michael Dürig<md...@apache.org>  wrote:
>>>
>>>
>>> Hi,
>>>
>>> While working on a new transient space implementation [1] I discovered
>>> that
>>> recreating a list of changes from transient modifications to the
>>> hierarchy
>>> is like opening a can of worms. Rethinking things it turned out, that it
>>> is
>>> much simpler to directly consolidate the change log. This approach is
>>> much
>>> more flexible and versatile since knowledge of the persisted hierarchy is
>>> not necessary at all. Thus this approach should work equally well with
>>> change logs for jr3, the Microkernel, Jackrabbit core, JCR and SPI.
>>>
>>> The algorithm is capable of creating minimal change logs from any valid
>>> change log (that is from any change log that can be applied to some
>>> hierarchy).
>>>
>>> I extensively tested the algorithm on randomly created change logs [3].
>>> This
>>> resulted in some interesting numbers: consolidation results in an
>>> decrease
>>> of about 45% of these randomly generated change logs. Furthermore the
>>> resulting size of the H2 database files *decreased up to 200%*!
>>>
>>> The implementation [2] is currently part of the jackrabbit-microkernel
>>> module in the sandbox. However it does not have any crucial dependencies
>>> so
>>> it should be easy to adapt for other usages.
>>>
>>> The gory details of the algorithm are based on some algebraic properties
>>> of
>>> move operations on paths as detailed in the following paragraphs.
>>>
>>> The change log consolidation algorithm implemented in the reduce method
>>> [2]
>>> is based on algebraic properties of move operations on paths. The other
>>> operations (add node, remove node and set property) are generalized to
>>> move
>>> operations. Consolidation relies on reduction and commutation rules of
>>> move
>>> operations.
>>>
>>> A move operation resembles a map on a hierarchy (of nodes and
>>> properties). A
>>> change log consisting of k move operations m_1 to m_k is thus the
>>> composition of the individual moves: m_1 *, ..., * m_k (using * for
>>> function
>>> composition: f(g(x)) = (f * g)(x)).
>>>
>>> Some definitions, notation and propositions:
>>>
>>> * Let NIL denote a path which never occurs in any hierarchy.
>>>
>>> * Order on paths: let p, q be paths.
>>>  - p<  q iff p != NIL and q != NIL and p is an ancestor of q.
>>>  - p<= q iff p<  q or p == q != NIL
>>>
>>> * Conflict of paths: let p, q be paths.
>>>  - p ~ q (p conflicts with q) iff either p<= q or q<= p
>>>
>>> * Substitution in paths: let p, q, r be paths.
>>>  - [p ->  q]r = r if p is not an ancestor of r and
>>>  - [p ->  q]r = s where s is the path resulting from replacing
>>>    the ancestor p in r with q otherwise.
>>>
>>> * Let p, q be paths. Then p:q denotes a move operation where the
>>>  node at p is moved to a new node at q.
>>>
>>> * Valid moves: leq p, q be paths.
>>>  - p:q is valid iff p !~ q or p = q
>>>  - if p:q is not valid, it is invalid
>>>
>>> Invalid moves are exactly those which would result in a node being moved
>>> to
>>> an ancestor/descendant of itself.
>>>
>>> * Identity on moves: let p, q be paths.
>>>  - p:q = ID iff p = q.
>>>
>>> * Conflict on moves: let p, q, r, s be paths.
>>>  - p:q ~ r:s (p:q conflicts with r:s) iff either p ~ r or p ~ s
>>>    or q ~ r or q ~ s.
>>>
>>> * Strict commutativity of moves: let m, n be moves.
>>>  - m * n = n * m iff m !~ n
>>>
>>> * Substitutions in moves: let p, q, r, s be paths.
>>>  - [p ->  q]r:s = [p ->  q]r:[p ->  q]s
>>>
>>> * Let p be a path and let +p denote an add node operation and let -p
>>>  denote a remove node operation for a node at path p.
>>>  - +p = NIL:p That is, adding a node is represented by a move from a
>>>    unknown source.
>>>  - p = p:NIL. That is, removing a node is represented by a move to an
>>>    unknown sink.
>>>
>>>
>>> Let m = p:q, n = r:s with p, q, r, s != NIL be valid moves with m != ID
>>> and
>>> n != ID. Then the following reduction and commutation rules apply:
>>>
>>> 1.  p!~ r:  m * n = n * m
>>> 2.  p<  r:  illegal (since this implies q<= r which implies p ~ q and
>>>            thus m invalid)
>>> 3.  p = r:  illegal (since this implies q<= r which implies p ~ q and
>>>            this m invalid)
>>> 4.  p>  r:  does not commute if q<  s. Otherwise m * n = n * [r ->  s]m
>>> 5.  p!~ s:  m * n = n * m
>>> 6.  p<  s:  illegal (since this implies p ~ q and thus m invalid)
>>> 7.  p = s:  does not commute
>>> 8.  p>  s:  illegal (since p>  s implies there is an s already which
>>>            will conflict with r:s)
>>> 9.  q!~ r:  m * n = n * m
>>> 10. q<  r:  m * n = [q ->  p]n * m
>>> 11. q = r:  m * n = p:s (transitivity of moves)
>>> 12. q>  r:  m * n = n * [r ->  s]m
>>> 13. q!~ s:  m * n = n * m
>>> 14. q<  s:  does not commute if p>  r. Otherwise m * n = [q ->  p]n * m
>>> 15. q = s:  illegal (since s conflicts with r:s)
>>> 16. q>  s:  illegal (since s conflicts with r:s)
>>>
>>> Allowing add node and remove node operations the following additional
>>> conditions apply:
>>>
>>> Let m = p:q, n = r:s be valid moves with m != ID and n != ID. Then the
>>> reduction and commutations rules 1. to 16. apply with extra conditions on
>>> 4., 10., 12. and 14.:
>>>
>>> 4'.  if s = NIL and q = NIL then m * n = -r. Otherwise if s = NIL then
>>>     m, n do not commute.
>>> 10'. illegal if p = NIL
>>> 12'. if s = NIL then m * n = -r * -p
>>> 14'. illegal if p = NIL
>>
>>
>> sounds very complicated ;) i admit that i don't
>> fully understand it (ok, i didn't try very hard
>> and gave up somewhere in the middle).
>
>
> It is actually much simpler than it looks. Have a look at the code and the
> inline comments for additional hints and my explanations below.
>
>
>> could you please provide a simple example?
>
>
> What I basically did is identify all cases of how moves, adds and removes
> interact. This turned out to be much simpler when conceptually handling adds
> as moves from 'nowhere' and deletes as moves to 'nowhere' (aka Trash) and
> then specialising interaction rules for adds and deletes from there.
>
> One kind of interaction is reductions like
>
>>/a:/b, >/b:/c = >/a:/c
> or
> -/a, -/a/b = -/a
> ...
>
> The other kind of interaction is commutation like
>
>>/a:/b, >/c:/d = >/c:/d, >/a:/b
> or
>> /a:/x, +/x/y:{} = +/a/y:{}, >/a:/x
> ...
>
> With a complete list of such interactions (see 1. - 16. in my initial post)
> it is possible to reduce a change log to its minimal equivalent by
> subsequent application of reductions and commutations: Given a reduced
> change list and a new operation inserted at index k, try to reduce it with
> the operation at k - 1. If this fails commute it with the operation at k - 1
> and repeat at k - 1 until either a reduction has been found or commutation
> fails. In the latter case do the same going forward (to k + 1). If still no
> reduction could be found, we are done and the change list is minimized.
> Otherwise, recursively apply the reduction algorithm treating the result of
> the reduction as the new operation inserted into the rest of the change
> list.
> This is exactly what is implemented in reduce(List<Operation>, int index).
> See
> http://svn.apache.org/viewvc/jackrabbit/sandbox/jackrabbit-microkernel/src/main/java/org/apache/jackrabbit/state/ChangeLog.java?view=markup
>
> Example:
>>/a:/x +/x/y:{} >/x:/b =  (commutation of 1 and 2, rule 14.)
> +/a/y:{} >/a:/x >/x:/b =  (reduction of 2 and 3, rule 11.)
> +/a/y:{} >/a:/b
>
> Some key observations which make reduce() work:
> - The interaction rules do not depend on the persisted hierarchy: If a
> change log can be applied to a hierarchy, its minimized equivalent can be
> applied to the same hierarchy. More over the observable effect on that
> hierarchy will be the same.
>
> - The reduce algorithm is minimizing since whether a reduction can occur or
> not does not depend on 'where it occurs'. Above example could as well reduce
> as follows:
>
>>/a:/x +/x/y:{} >/x:/b =  (commutation of 2 and 3, rule 12.)
>>/a:/x >/x:/b +/b/y =     (reduction of 1 and 2, rule 11.)
>>/a:/b +/b/y:{}
>
>
>> what is the benefit of this approach compared
>> to the imlementation in jackrabbit-core
>> (which does consolidate transient changes)?
>
>
> - Versatility: it is not bound to any implementation, nor does it need the
> actual hierarchy (i.e. it is stateless). It is basically a realisation of a
> map from change logs to change logs.

ok, agreed.

>
> - Self contained: everything in one single quite small class. No
> dependencies, no clutter.
>
> - Much simpler.

erm, i beg to differ ... ;)

>
> - Provably sound and complete.
>
> - Provably minimizing on the change logs.
>
> - Portable since it doesn't depend on a specific underlying implementation.
>
> - Reduction of up to 45% of the change log communicated from the upper
> layers to the Microkernel.

great! however, to be fair, you're comparing 2 different change log-based
transient space implementations in the sandbox.

transient space implementations don't necessarily need to be change log-based
(see e.g. jackrabbit-core). i assume you would see little (if not no) reduction
in the number of persisted changes per save operation when comparing to the
jackrabbit-core transient space implementation. OTOH, i am not saying
that the jackrabbit-core approach is suitable when using a microkernel-based
backend and i am not questioning the usefulness of consolidated change logs
in general.

>
> - In the case of H2, reduction of up to 200% of database size.

see my previous comment

>
> Apart from the benefits, it is a plain necessity for transient space
> implementations on top of the Microkernel: without proper consolidation
> users could end up not being able to save their (valid) transient
> modifications. Consider a first user doing
>
>>/a:/t/x +/t/x/y:{} >/t/x:/b
>
> if in the meanwhile another user removes x the save operation for the first
> user would fail.

how could another user remove /t/x? it only exists in the first user's
transient space.

cheers
stefan

>
> After consolidation, above change log would look like
>
> +/a/y:{} >/a:/b
>
> and can be saved.
>
> Michael
>
>
>>
>> cheers
>> stefan
>>
>>>
>>> Michael
>>>
>>>
>>> [1] http://markmail.org/message/qkkcvtmtapas2cx4
>>> [2]
>>>
>>> http://svn.apache.org/viewvc/jackrabbit/sandbox/jackrabbit-microkernel/src/main/java/org/apache/jackrabbit/state/ChangeLog.java?view=markup
>>> [3]
>>>
>>> http://svn.apache.org/viewvc/jackrabbit/sandbox/jackrabbit-microkernel/src/test/java/org/apache/jackrabbit/state/ChangeLogFuzzTest.java?view=markup

Re: [jr3 microkernel] Change log consolidation

Posted by Michael Dürig <md...@apache.org>.


On 4.2.12 20:07, Stefan Guggisberg wrote:
> On Fri, Feb 3, 2012 at 7:44 PM, Michael Dürig<md...@apache.org>  wrote:
>>
>> Hi,
>>
>> While working on a new transient space implementation [1] I discovered that
>> recreating a list of changes from transient modifications to the hierarchy
>> is like opening a can of worms. Rethinking things it turned out, that it is
>> much simpler to directly consolidate the change log. This approach is much
>> more flexible and versatile since knowledge of the persisted hierarchy is
>> not necessary at all. Thus this approach should work equally well with
>> change logs for jr3, the Microkernel, Jackrabbit core, JCR and SPI.
>>
>> The algorithm is capable of creating minimal change logs from any valid
>> change log (that is from any change log that can be applied to some
>> hierarchy).
>>
>> I extensively tested the algorithm on randomly created change logs [3]. This
>> resulted in some interesting numbers: consolidation results in an decrease
>> of about 45% of these randomly generated change logs. Furthermore the
>> resulting size of the H2 database files *decreased up to 200%*!
>>
>> The implementation [2] is currently part of the jackrabbit-microkernel
>> module in the sandbox. However it does not have any crucial dependencies so
>> it should be easy to adapt for other usages.
>>
>> The gory details of the algorithm are based on some algebraic properties of
>> move operations on paths as detailed in the following paragraphs.
>>
>> The change log consolidation algorithm implemented in the reduce method [2]
>> is based on algebraic properties of move operations on paths. The other
>> operations (add node, remove node and set property) are generalized to move
>> operations. Consolidation relies on reduction and commutation rules of move
>> operations.
>>
>> A move operation resembles a map on a hierarchy (of nodes and properties). A
>> change log consisting of k move operations m_1 to m_k is thus the
>> composition of the individual moves: m_1 *, ..., * m_k (using * for function
>> composition: f(g(x)) = (f * g)(x)).
>>
>> Some definitions, notation and propositions:
>>
>> * Let NIL denote a path which never occurs in any hierarchy.
>>
>> * Order on paths: let p, q be paths.
>>   - p<  q iff p != NIL and q != NIL and p is an ancestor of q.
>>   - p<= q iff p<  q or p == q != NIL
>>
>> * Conflict of paths: let p, q be paths.
>>   - p ~ q (p conflicts with q) iff either p<= q or q<= p
>>
>> * Substitution in paths: let p, q, r be paths.
>>   - [p ->  q]r = r if p is not an ancestor of r and
>>   - [p ->  q]r = s where s is the path resulting from replacing
>>     the ancestor p in r with q otherwise.
>>
>> * Let p, q be paths. Then p:q denotes a move operation where the
>>   node at p is moved to a new node at q.
>>
>> * Valid moves: leq p, q be paths.
>>   - p:q is valid iff p !~ q or p = q
>>   - if p:q is not valid, it is invalid
>>
>> Invalid moves are exactly those which would result in a node being moved to
>> an ancestor/descendant of itself.
>>
>> * Identity on moves: let p, q be paths.
>>   - p:q = ID iff p = q.
>>
>> * Conflict on moves: let p, q, r, s be paths.
>>   - p:q ~ r:s (p:q conflicts with r:s) iff either p ~ r or p ~ s
>>     or q ~ r or q ~ s.
>>
>> * Strict commutativity of moves: let m, n be moves.
>>   - m * n = n * m iff m !~ n
>>
>> * Substitutions in moves: let p, q, r, s be paths.
>>   - [p ->  q]r:s = [p ->  q]r:[p ->  q]s
>>
>> * Let p be a path and let +p denote an add node operation and let -p
>>   denote a remove node operation for a node at path p.
>>   - +p = NIL:p That is, adding a node is represented by a move from a
>>     unknown source.
>>   - p = p:NIL. That is, removing a node is represented by a move to an
>>     unknown sink.
>>
>>
>> Let m = p:q, n = r:s with p, q, r, s != NIL be valid moves with m != ID and
>> n != ID. Then the following reduction and commutation rules apply:
>>
>> 1.  p!~ r:  m * n = n * m
>> 2.  p<  r:  illegal (since this implies q<= r which implies p ~ q and
>>             thus m invalid)
>> 3.  p = r:  illegal (since this implies q<= r which implies p ~ q and
>>             this m invalid)
>> 4.  p>  r:  does not commute if q<  s. Otherwise m * n = n * [r ->  s]m
>> 5.  p!~ s:  m * n = n * m
>> 6.  p<  s:  illegal (since this implies p ~ q and thus m invalid)
>> 7.  p = s:  does not commute
>> 8.  p>  s:  illegal (since p>  s implies there is an s already which
>>             will conflict with r:s)
>> 9.  q!~ r:  m * n = n * m
>> 10. q<  r:  m * n = [q ->  p]n * m
>> 11. q = r:  m * n = p:s (transitivity of moves)
>> 12. q>  r:  m * n = n * [r ->  s]m
>> 13. q!~ s:  m * n = n * m
>> 14. q<  s:  does not commute if p>  r. Otherwise m * n = [q ->  p]n * m
>> 15. q = s:  illegal (since s conflicts with r:s)
>> 16. q>  s:  illegal (since s conflicts with r:s)
>>
>> Allowing add node and remove node operations the following additional
>> conditions apply:
>>
>> Let m = p:q, n = r:s be valid moves with m != ID and n != ID. Then the
>> reduction and commutations rules 1. to 16. apply with extra conditions on
>> 4., 10., 12. and 14.:
>>
>> 4'.  if s = NIL and q = NIL then m * n = -r. Otherwise if s = NIL then
>>      m, n do not commute.
>> 10'. illegal if p = NIL
>> 12'. if s = NIL then m * n = -r * -p
>> 14'. illegal if p = NIL
>
> sounds very complicated ;) i admit that i don't
> fully understand it (ok, i didn't try very hard
> and gave up somewhere in the middle).

It is actually much simpler than it looks. Have a look at the code and 
the inline comments for additional hints and my explanations below.

> could you please provide a simple example?

What I basically did is identify all cases of how moves, adds and 
removes interact. This turned out to be much simpler when conceptually 
handling adds as moves from 'nowhere' and deletes as moves to 'nowhere' 
(aka Trash) and then specialising interaction rules for adds and deletes 
from there.

One kind of interaction is reductions like

 >/a:/b, >/b:/c = >/a:/c
or
-/a, -/a/b = -/a
...

The other kind of interaction is commutation like

 >/a:/b, >/c:/d = >/c:/d, >/a:/b
or
 > /a:/x, +/x/y:{} = +/a/y:{}, >/a:/x
...

With a complete list of such interactions (see 1. - 16. in my initial 
post) it is possible to reduce a change log to its minimal equivalent by 
subsequent application of reductions and commutations: Given a reduced 
change list and a new operation inserted at index k, try to reduce it 
with the operation at k - 1. If this fails commute it with the operation 
at k - 1 and repeat at k - 1 until either a reduction has been found or 
commutation fails. In the latter case do the same going forward (to k + 
1). If still no reduction could be found, we are done and the change 
list is minimized. Otherwise, recursively apply the reduction algorithm 
treating the result of the reduction as the new operation inserted into 
the rest of the change list.
This is exactly what is implemented in reduce(List<Operation>, int 
index). See 
http://svn.apache.org/viewvc/jackrabbit/sandbox/jackrabbit-microkernel/src/main/java/org/apache/jackrabbit/state/ChangeLog.java?view=markup

Example:
 >/a:/x +/x/y:{} >/x:/b =  (commutation of 1 and 2, rule 14.)
+/a/y:{} >/a:/x >/x:/b =  (reduction of 2 and 3, rule 11.)
+/a/y:{} >/a:/b

Some key observations which make reduce() work:
- The interaction rules do not depend on the persisted hierarchy: If a 
change log can be applied to a hierarchy, its minimized equivalent can 
be applied to the same hierarchy. More over the observable effect on 
that hierarchy will be the same.

- The reduce algorithm is minimizing since whether a reduction can occur 
or not does not depend on 'where it occurs'. Above example could as well 
reduce as follows:

 >/a:/x +/x/y:{} >/x:/b =  (commutation of 2 and 3, rule 12.)
 >/a:/x >/x:/b +/b/y =     (reduction of 1 and 2, rule 11.)
 >/a:/b +/b/y:{}

> what is the benefit of this approach compared
> to the imlementation in jackrabbit-core
> (which does consolidate transient changes)?

- Versatility: it is not bound to any implementation, nor does it need 
the actual hierarchy (i.e. it is stateless). It is basically a 
realisation of a map from change logs to change logs.

- Self contained: everything in one single quite small class. No 
dependencies, no clutter.

- Much simpler.

- Provably sound and complete.

- Provably minimizing on the change logs.

- Portable since it doesn't depend on a specific underlying implementation.

- Reduction of up to 45% of the change log communicated from the upper 
layers to the Microkernel.

- In the case of H2, reduction of up to 200% of database size.

Apart from the benefits, it is a plain necessity for transient space 
implementations on top of the Microkernel: without proper consolidation 
users could end up not being able to save their (valid) transient 
modifications. Consider a first user doing

 >/a:/t/x +/t/x/y:{} >/t/x:/b

if in the meanwhile another user removes x the save operation for the 
first user would fail.

After consolidation, above change log would look like

+/a/y:{} >/a:/b

and can be saved.

Michael

>
> cheers
> stefan
>
>>
>> Michael
>>
>>
>> [1] http://markmail.org/message/qkkcvtmtapas2cx4
>> [2]
>> http://svn.apache.org/viewvc/jackrabbit/sandbox/jackrabbit-microkernel/src/main/java/org/apache/jackrabbit/state/ChangeLog.java?view=markup
>> [3]
>> http://svn.apache.org/viewvc/jackrabbit/sandbox/jackrabbit-microkernel/src/test/java/org/apache/jackrabbit/state/ChangeLogFuzzTest.java?view=markup

Re: [jr3 microkernel] Change log consolidation

Posted by Stefan Guggisberg <st...@gmail.com>.

On Fri, Feb 3, 2012 at 7:44 PM, Michael Dürig <md...@apache.org> wrote:
>
> Hi,
>
> While working on a new transient space implementation [1] I discovered that
> recreating a list of changes from transient modifications to the hierarchy
> is like opening a can of worms. Rethinking things it turned out, that it is
> much simpler to directly consolidate the change log. This approach is much
> more flexible and versatile since knowledge of the persisted hierarchy is
> not necessary at all. Thus this approach should work equally well with
> change logs for jr3, the Microkernel, Jackrabbit core, JCR and SPI.
>
> The algorithm is capable of creating minimal change logs from any valid
> change log (that is from any change log that can be applied to some
> hierarchy).
>
> I extensively tested the algorithm on randomly created change logs [3]. This
> resulted in some interesting numbers: consolidation results in an decrease
> of about 45% of these randomly generated change logs. Furthermore the
> resulting size of the H2 database files *decreased up to 200%*!
>
> The implementation [2] is currently part of the jackrabbit-microkernel
> module in the sandbox. However it does not have any crucial dependencies so
> it should be easy to adapt for other usages.
>
> The gory details of the algorithm are based on some algebraic properties of
> move operations on paths as detailed in the following paragraphs.
>
> The change log consolidation algorithm implemented in the reduce method [2]
> is based on algebraic properties of move operations on paths. The other
> operations (add node, remove node and set property) are generalized to move
> operations. Consolidation relies on reduction and commutation rules of move
> operations.
>
> A move operation resembles a map on a hierarchy (of nodes and properties). A
> change log consisting of k move operations m_1 to m_k is thus the
> composition of the individual moves: m_1 *, ..., * m_k (using * for function
> composition: f(g(x)) = (f * g)(x)).
>
> Some definitions, notation and propositions:
>
> * Let NIL denote a path which never occurs in any hierarchy.
>
> * Order on paths: let p, q be paths.
>  - p < q iff p != NIL and q != NIL and p is an ancestor of q.
>  - p <= q iff p < q or p == q != NIL
>
> * Conflict of paths: let p, q be paths.
>  - p ~ q (p conflicts with q) iff either p <= q or q <= p
>
> * Substitution in paths: let p, q, r be paths.
>  - [p -> q]r = r if p is not an ancestor of r and
>  - [p -> q]r = s where s is the path resulting from replacing
>    the ancestor p in r with q otherwise.
>
> * Let p, q be paths. Then p:q denotes a move operation where the
>  node at p is moved to a new node at q.
>
> * Valid moves: leq p, q be paths.
>  - p:q is valid iff p !~ q or p = q
>  - if p:q is not valid, it is invalid
>
> Invalid moves are exactly those which would result in a node being moved to
> an ancestor/descendant of itself.
>
> * Identity on moves: let p, q be paths.
>  - p:q = ID iff p = q.
>
> * Conflict on moves: let p, q, r, s be paths.
>  - p:q ~ r:s (p:q conflicts with r:s) iff either p ~ r or p ~ s
>    or q ~ r or q ~ s.
>
> * Strict commutativity of moves: let m, n be moves.
>  - m * n = n * m iff m !~ n
>
> * Substitutions in moves: let p, q, r, s be paths.
>  - [p -> q]r:s = [p -> q]r:[p -> q]s
>
> * Let p be a path and let +p denote an add node operation and let -p
>  denote a remove node operation for a node at path p.
>  - +p = NIL:p That is, adding a node is represented by a move from a
>    unknown source.
>  - p = p:NIL. That is, removing a node is represented by a move to an
>    unknown sink.
>
>
> Let m = p:q, n = r:s with p, q, r, s != NIL be valid moves with m != ID and
> n != ID. Then the following reduction and commutation rules apply:
>
> 1.  p!~ r:  m * n = n * m
> 2.  p < r:  illegal (since this implies q <= r which implies p ~ q and
>            thus m invalid)
> 3.  p = r:  illegal (since this implies q <= r which implies p ~ q and
>            this m invalid)
> 4.  p > r:  does not commute if q < s. Otherwise m * n = n * [r -> s]m
> 5.  p!~ s:  m * n = n * m
> 6.  p < s:  illegal (since this implies p ~ q and thus m invalid)
> 7.  p = s:  does not commute
> 8.  p > s:  illegal (since p > s implies there is an s already which
>            will conflict with r:s)
> 9.  q!~ r:  m * n = n * m
> 10. q < r:  m * n = [q -> p]n * m
> 11. q = r:  m * n = p:s (transitivity of moves)
> 12. q > r:  m * n = n * [r -> s]m
> 13. q!~ s:  m * n = n * m
> 14. q < s:  does not commute if p > r. Otherwise m * n = [q -> p]n * m
> 15. q = s:  illegal (since s conflicts with r:s)
> 16. q > s:  illegal (since s conflicts with r:s)
>
> Allowing add node and remove node operations the following additional
> conditions apply:
>
> Let m = p:q, n = r:s be valid moves with m != ID and n != ID. Then the
> reduction and commutations rules 1. to 16. apply with extra conditions on
> 4., 10., 12. and 14.:
>
> 4'.  if s = NIL and q = NIL then m * n = -r. Otherwise if s = NIL then
>     m, n do not commute.
> 10'. illegal if p = NIL
> 12'. if s = NIL then m * n = -r * -p
> 14'. illegal if p = NIL

sounds very complicated ;) i admit that i don't
fully understand it (ok, i didn't try very hard
and gave up somewhere in the middle).

could you please provide a simple example?

what is the benefit of this approach compared
to the imlementation in jackrabbit-core
(which does consolidate transient changes)?

cheers
stefan

>
> Michael
>
>
> [1] http://markmail.org/message/qkkcvtmtapas2cx4
> [2]
> http://svn.apache.org/viewvc/jackrabbit/sandbox/jackrabbit-microkernel/src/main/java/org/apache/jackrabbit/state/ChangeLog.java?view=markup
> [3]
> http://svn.apache.org/viewvc/jackrabbit/sandbox/jackrabbit-microkernel/src/test/java/org/apache/jackrabbit/state/ChangeLogFuzzTest.java?view=markup

Re: [jr3 microkernel] Change log consolidation

Posted by Julian Reschke <ju...@gmx.de>.

On 2012-02-03 19:44, Michael Dürig wrote:
> ...

Sounds very interesting.

I have a comment that may not even apply to this work, but still: 
recently a JCR issue was opened because Jackrabbit doesn't fire 
compliant NODE_MOVE events for reorder operations. Turns out that 
Jackrabbit doesn't keep sufficient information, and then tries to guess, 
and sometimes guesses wrong.

We should make a conscious decision whether it's important to have this 
information in observation events. Based on that we may want to tune the 
JCR 2.1 spec.

Best regards, Julian

Re: [jr3 microkernel] Change log consolidation

Posted by Michael Dürig <md...@apache.org>.

Hi,

Coming back to this after 5 years, yay!

Turns out that mentioned approach is nowadays called "patch commutation" 
[1, 2] if we regard change logs as patches, that is. It is used as 
foundation of the darcs version control system [3].

The way Oak rebases branches on commit [4] is actually the same thing in 
guise: commutation of patches. Only that Oak does not consider the 
commutative properties of move operations, which the original proposal 
below did.

The reason I'm bringing this up now is that this topic might turn out 
helpful again once we start talking about conflict reconciliation under 
network partition (e.g. geo-replication). As Alexander Klimetschek 
already pointed out in the discussions 5 years back BTW!

Michael
	

[1] https://en.wikibooks.org/wiki/Understanding_Darcs/Patch_theory
[2] 
https://byorgey.wordpress.com/2008/02/13/patch-theory-part-ii-some-basics/
[3] http://darcs.net/
[4] 
https://wiki.apache.org/jackrabbit/Conflict%20handling%20through%20rebasing%20branches


On 3.2.12 7:44 , Michael Drig wrote:
>
> Hi,
>
> While working on a new transient space implementation [1] I discovered
> that recreating a list of changes from transient modifications to the
> hierarchy is like opening a can of worms. Rethinking things it turned
> out, that it is much simpler to directly consolidate the change log.
> This approach is much more flexible and versatile since knowledge of the
> persisted hierarchy is not necessary at all. Thus this approach should
> work equally well with change logs for jr3, the Microkernel, Jackrabbit
> core, JCR and SPI.
>
> The algorithm is capable of creating minimal change logs from any valid
> change log (that is from any change log that can be applied to some
> hierarchy).
>
> I extensively tested the algorithm on randomly created change logs [3].
> This resulted in some interesting numbers: consolidation results in an
> decrease of about 45% of these randomly generated change logs.
> Furthermore the resulting size of the H2 database files *decreased up to
> 200%*!
>
> The implementation [2] is currently part of the jackrabbit-microkernel
> module in the sandbox. However it does not have any crucial dependencies
> so it should be easy to adapt for other usages.
>
> The gory details of the algorithm are based on some algebraic properties
> of move operations on paths as detailed in the following paragraphs.
>
> The change log consolidation algorithm implemented in the reduce method
> [2] is based on algebraic properties of move operations on paths. The
> other operations (add node, remove node and set property) are
> generalized to move operations. Consolidation relies on reduction and
> commutation rules of move operations.
>
> A move operation resembles a map on a hierarchy (of nodes and
> properties). A change log consisting of k move operations m_1 to m_k is
> thus the composition of the individual moves: m_1 *, ..., * m_k (using *
> for function composition: f(g(x)) = (f * g)(x)).
>
> Some definitions, notation and propositions:
>
> * Let NIL denote a path which never occurs in any hierarchy.
>
> * Order on paths: let p, q be paths.
>   - p < q iff p != NIL and q != NIL and p is an ancestor of q.
>   - p <= q iff p < q or p == q != NIL
>
> * Conflict of paths: let p, q be paths.
>   - p ~ q (p conflicts with q) iff either p <= q or q <= p
>
> * Substitution in paths: let p, q, r be paths.
>   - [p -> q]r = r if p is not an ancestor of r and
>   - [p -> q]r = s where s is the path resulting from replacing
>     the ancestor p in r with q otherwise.
>
> * Let p, q be paths. Then p:q denotes a move operation where the
>   node at p is moved to a new node at q.
>
> * Valid moves: leq p, q be paths.
>   - p:q is valid iff p !~ q or p = q
>   - if p:q is not valid, it is invalid
>
> Invalid moves are exactly those which would result in a node being moved
> to an ancestor/descendant of itself.
>
> * Identity on moves: let p, q be paths.
>   - p:q = ID iff p = q.
>
> * Conflict on moves: let p, q, r, s be paths.
>   - p:q ~ r:s (p:q conflicts with r:s) iff either p ~ r or p ~ s
>     or q ~ r or q ~ s.
>
> * Strict commutativity of moves: let m, n be moves.
>   - m * n = n * m iff m !~ n
>
> * Substitutions in moves: let p, q, r, s be paths.
>   - [p -> q]r:s = [p -> q]r:[p -> q]s
>
> * Let p be a path and let +p denote an add node operation and let -p
>   denote a remove node operation for a node at path p.
>   - +p = NIL:p That is, adding a node is represented by a move from a
>     unknown source.
>   - p = p:NIL. That is, removing a node is represented by a move to an
>     unknown sink.
>
>
> Let m = p:q, n = r:s with p, q, r, s != NIL be valid moves with m != ID
> and n != ID. Then the following reduction and commutation rules apply:
>
> 1.  p!~ r:  m * n = n * m
> 2.  p < r:  illegal (since this implies q <= r which implies p ~ q and
>             thus m invalid)
> 3.  p = r:  illegal (since this implies q <= r which implies p ~ q and
>             this m invalid)
> 4.  p > r:  does not commute if q < s. Otherwise m * n = n * [r -> s]m
> 5.  p!~ s:  m * n = n * m
> 6.  p < s:  illegal (since this implies p ~ q and thus m invalid)
> 7.  p = s:  does not commute
> 8.  p > s:  illegal (since p > s implies there is an s already which
>             will conflict with r:s)
> 9.  q!~ r:  m * n = n * m
> 10. q < r:  m * n = [q -> p]n * m
> 11. q = r:  m * n = p:s (transitivity of moves)
> 12. q > r:  m * n = n * [r -> s]m
> 13. q!~ s:  m * n = n * m
> 14. q < s:  does not commute if p > r. Otherwise m * n = [q -> p]n * m
> 15. q = s:  illegal (since s conflicts with r:s)
> 16. q > s:  illegal (since s conflicts with r:s)
>
> Allowing add node and remove node operations the following additional
> conditions apply:
>
> Let m = p:q, n = r:s be valid moves with m != ID and n != ID. Then the
> reduction and commutations rules 1. to 16. apply with extra conditions
> on 4., 10., 12. and 14.:
>
> 4'.  if s = NIL and q = NIL then m * n = -r. Otherwise if s = NIL then
>      m, n do not commute.
> 10'. illegal if p = NIL
> 12'. if s = NIL then m * n = -r * -p
> 14'. illegal if p = NIL
>
> Michael
>
>
> [1] http://markmail.org/message/qkkcvtmtapas2cx4
> [2]
> http://svn.apache.org/viewvc/jackrabbit/sandbox/jackrabbit-microkernel/src/main/java/org/apache/jackrabbit/state/ChangeLog.java?view=markup
>
> [3]
> http://svn.apache.org/viewvc/jackrabbit/sandbox/jackrabbit-microkernel/src/test/java/org/apache/jackrabbit/state/ChangeLogFuzzTest.java?view=markup
>

Re: [jr3 microkernel] Change log consolidation

Posted by Michael Dürig <md...@apache.org>.

Hi,

Coming back to this after 5 years, yay!

Turns out that mentioned approach is nowadays called "patch commutation" 
[1, 2] if we regard change logs as patches, that is. It is used as 
foundation of the darcs version control system [3].

The way Oak rebases branches on commit [4] is actually the same thing in 
guise: commutation of patches. Only that Oak does not consider the 
commutative properties of move operations, which the original proposal 
below did.

The reason I'm bringing this up now is that this topic might turn out 
helpful again once we start talking about conflict reconciliation under 
network partition (e.g. geo-replication). As Alexander Klimetschek 
already pointed out in the discussions 5 years back BTW!

Michael
	

[1] https://en.wikibooks.org/wiki/Understanding_Darcs/Patch_theory
[2] 
https://byorgey.wordpress.com/2008/02/13/patch-theory-part-ii-some-basics/
[3] http://darcs.net/
[4] 
https://wiki.apache.org/jackrabbit/Conflict%20handling%20through%20rebasing%20branches


On 3.2.12 7:44 , Michael Drig wrote:
>
> Hi,
>
> While working on a new transient space implementation [1] I discovered
> that recreating a list of changes from transient modifications to the
> hierarchy is like opening a can of worms. Rethinking things it turned
> out, that it is much simpler to directly consolidate the change log.
> This approach is much more flexible and versatile since knowledge of the
> persisted hierarchy is not necessary at all. Thus this approach should
> work equally well with change logs for jr3, the Microkernel, Jackrabbit
> core, JCR and SPI.
>
> The algorithm is capable of creating minimal change logs from any valid
> change log (that is from any change log that can be applied to some
> hierarchy).
>
> I extensively tested the algorithm on randomly created change logs [3].
> This resulted in some interesting numbers: consolidation results in an
> decrease of about 45% of these randomly generated change logs.
> Furthermore the resulting size of the H2 database files *decreased up to
> 200%*!
>
> The implementation [2] is currently part of the jackrabbit-microkernel
> module in the sandbox. However it does not have any crucial dependencies
> so it should be easy to adapt for other usages.
>
> The gory details of the algorithm are based on some algebraic properties
> of move operations on paths as detailed in the following paragraphs.
>
> The change log consolidation algorithm implemented in the reduce method
> [2] is based on algebraic properties of move operations on paths. The
> other operations (add node, remove node and set property) are
> generalized to move operations. Consolidation relies on reduction and
> commutation rules of move operations.
>
> A move operation resembles a map on a hierarchy (of nodes and
> properties). A change log consisting of k move operations m_1 to m_k is
> thus the composition of the individual moves: m_1 *, ..., * m_k (using *
> for function composition: f(g(x)) = (f * g)(x)).
>
> Some definitions, notation and propositions:
>
> * Let NIL denote a path which never occurs in any hierarchy.
>
> * Order on paths: let p, q be paths.
>   - p < q iff p != NIL and q != NIL and p is an ancestor of q.
>   - p <= q iff p < q or p == q != NIL
>
> * Conflict of paths: let p, q be paths.
>   - p ~ q (p conflicts with q) iff either p <= q or q <= p
>
> * Substitution in paths: let p, q, r be paths.
>   - [p -> q]r = r if p is not an ancestor of r and
>   - [p -> q]r = s where s is the path resulting from replacing
>     the ancestor p in r with q otherwise.
>
> * Let p, q be paths. Then p:q denotes a move operation where the
>   node at p is moved to a new node at q.
>
> * Valid moves: leq p, q be paths.
>   - p:q is valid iff p !~ q or p = q
>   - if p:q is not valid, it is invalid
>
> Invalid moves are exactly those which would result in a node being moved
> to an ancestor/descendant of itself.
>
> * Identity on moves: let p, q be paths.
>   - p:q = ID iff p = q.
>
> * Conflict on moves: let p, q, r, s be paths.
>   - p:q ~ r:s (p:q conflicts with r:s) iff either p ~ r or p ~ s
>     or q ~ r or q ~ s.
>
> * Strict commutativity of moves: let m, n be moves.
>   - m * n = n * m iff m !~ n
>
> * Substitutions in moves: let p, q, r, s be paths.
>   - [p -> q]r:s = [p -> q]r:[p -> q]s
>
> * Let p be a path and let +p denote an add node operation and let -p
>   denote a remove node operation for a node at path p.
>   - +p = NIL:p That is, adding a node is represented by a move from a
>     unknown source.
>   - p = p:NIL. That is, removing a node is represented by a move to an
>     unknown sink.
>
>
> Let m = p:q, n = r:s with p, q, r, s != NIL be valid moves with m != ID
> and n != ID. Then the following reduction and commutation rules apply:
>
> 1.  p!~ r:  m * n = n * m
> 2.  p < r:  illegal (since this implies q <= r which implies p ~ q and
>             thus m invalid)
> 3.  p = r:  illegal (since this implies q <= r which implies p ~ q and
>             this m invalid)
> 4.  p > r:  does not commute if q < s. Otherwise m * n = n * [r -> s]m
> 5.  p!~ s:  m * n = n * m
> 6.  p < s:  illegal (since this implies p ~ q and thus m invalid)
> 7.  p = s:  does not commute
> 8.  p > s:  illegal (since p > s implies there is an s already which
>             will conflict with r:s)
> 9.  q!~ r:  m * n = n * m
> 10. q < r:  m * n = [q -> p]n * m
> 11. q = r:  m * n = p:s (transitivity of moves)
> 12. q > r:  m * n = n * [r -> s]m
> 13. q!~ s:  m * n = n * m
> 14. q < s:  does not commute if p > r. Otherwise m * n = [q -> p]n * m
> 15. q = s:  illegal (since s conflicts with r:s)
> 16. q > s:  illegal (since s conflicts with r:s)
>
> Allowing add node and remove node operations the following additional
> conditions apply:
>
> Let m = p:q, n = r:s be valid moves with m != ID and n != ID. Then the
> reduction and commutations rules 1. to 16. apply with extra conditions
> on 4., 10., 12. and 14.:
>
> 4'.  if s = NIL and q = NIL then m * n = -r. Otherwise if s = NIL then
>      m, n do not commute.
> 10'. illegal if p = NIL
> 12'. if s = NIL then m * n = -r * -p
> 14'. illegal if p = NIL
>
> Michael
>
>
> [1] http://markmail.org/message/qkkcvtmtapas2cx4
> [2]
> http://svn.apache.org/viewvc/jackrabbit/sandbox/jackrabbit-microkernel/src/main/java/org/apache/jackrabbit/state/ChangeLog.java?view=markup
>
> [3]
> http://svn.apache.org/viewvc/jackrabbit/sandbox/jackrabbit-microkernel/src/test/java/org/apache/jackrabbit/state/ChangeLogFuzzTest.java?view=markup
>

Re: [jr3 microkernel] Change log consolidation

Posted by Michael Dürig <mi...@gmail.com>.


>> > I understand why a move to a descendant of self is invalid.
>> >
>> > But why is a move to an ancestor of self invalid ?
>> Because there is a conflict on the target name.
>>
>> > Eg. "mv /a/b/c /a" resulting in /a/b and /a/c
>> No. This means moving node /a/b/c to the root with name a. But /a exists
>> already.
>
> It certainly means what Felix said in a POXIX mv command. Are you using
> a different notation here?

The notation resembles that of the JSOP move operation (minus clutter) 
which it turn reflects the semantics of the JCR move methods: it 
includes the name of the target node. See also my definition above:

"Let p, q be paths. Then p:q denotes a move operation where the node at 
p is moved to a new node at q."

Michael

Re: [jr3 microkernel] Change log consolidation

Posted by Julian Reschke <ju...@gmx.de>.

On 2012-02-06 13:38, Michael Dürig wrote:
>
>
> On Monday, February 6, 2012, Felix Meschberger <fmeschbe@adobe.com
> <ma...@adobe.com>> wrote:
>  > Hi,
>  >
>  > Am 03.02.2012 um 19:44 schrieb Michael Dürig:
>  >
>  >> Invalid moves are exactly those which would result in a node being moved
>  >> to an ancestor/descendant of itself.
>  >
>  > I understand why a move to a descendant of self is invalid.
>  >
>  > But why is a move to an ancestor of self invalid ?
> Because there is a conflict on the target name.
>
>  > Eg. "mv /a/b/c /a" resulting in /a/b and /a/c
> No. This means moving node /a/b/c to the root with name a. But /a exists
> already.

It certainly means what Felix said in a POXIX mv command. Are you using 
a different notation here?

Re: [jr3 microkernel] Change log consolidation

Posted by Felix Meschberger <fm...@adobe.com>.

Hi,

Am 06.02.2012 um 13:38 schrieb Michael Dürig:



On Monday, February 6, 2012, Felix Meschberger <fm...@adobe.com>> wrote:
> Hi,
>
> Am 03.02.2012 um 19:44 schrieb Michael Dürig:
>
>> Invalid moves are exactly those which would result in a node being moved
>> to an ancestor/descendant of itself.
>
> I understand why a move to a descendant of self is invalid.
>
> But why is a move to an ancestor of self invalid ?
Because there is a conflict on the target name.

> Eg. "mv /a/b/c /a" resulting in /a/b and /a/c
No. This means moving node /a/b/c to the root with name a. But /a exists already.

Ok, now I understand. Thanks.

Regards
Felix


Michael

>
> Did miss(-understand) something ?
>
> Regards
> Felix

Re: [jr3 microkernel] Change log consolidation

Posted by Michael Dürig <mi...@gmail.com>.

On Monday, February 6, 2012, Felix Meschberger <fm...@adobe.com> wrote:
> Hi,
>
> Am 03.02.2012 um 19:44 schrieb Michael Dürig:
>
>> Invalid moves are exactly those which would result in a node being moved
>> to an ancestor/descendant of itself.
>
> I understand why a move to a descendant of self is invalid.
>
> But why is a move to an ancestor of self invalid ?
Because there is a conflict on the target name.

> Eg. "mv /a/b/c /a" resulting in /a/b and /a/c
No. This means moving node /a/b/c to the root with name a. But /a exists
already.

Michael

>
> Did miss(-understand) something ?
>
> Regards
> Felix

Re: [jr3 microkernel] Change log consolidation

Posted by Felix Meschberger <fm...@adobe.com>.

Hi,

Am 03.02.2012 um 19:44 schrieb Michael Dürig:

> Invalid moves are exactly those which would result in a node being moved 
> to an ancestor/descendant of itself.

I understand why a move to a descendant of self is invalid.

But why is a move to an ancestor of self invalid ?

Eg. "mv /a/b/c /a" resulting in /a/b and /a/c 

Did miss(-understand) something ?

Regards
Felix

Re: [jr3 microkernel] Change log consolidation

Posted by Julian Reschke <ju...@gmx.de>.

On 2012-02-06 15:18, Stefan Guggisberg wrote:
> On Mon, Feb 6, 2012 at 3:10 PM, Julian Reschke<ju...@gmx.de>  wrote:
>> On 2012-02-06 15:02, Stefan Guggisberg wrote:
>>>
>>> On Mon, Feb 6, 2012 at 2:39 PM, Thomas Mueller<mu...@adobe.com>    wrote:
>>>>
>>>> Hi,
>>>>
>>>>> Instead of separate Added/Existing/Moved/Removed instances in the
>>>>> ChangeTree, did you consider keeping the modified content just as
>>>>> (say) TransientNode instances, without trying to keep track of how
>>>>> they came to exist?
>>>>> Then, when you actually need the change log, you
>>>>> should still be able to construct it by diffing the transient tree
>>>>> against the persistent base state of the content tree. The only caveat
>>>>> I know of is that moves can only be reliably detected for
>>>>> referenceable nodes or ones with an equivalent internal unique ID
>>>>> (which we shouldn't have trouble doing if needed).
>>>>
>>>>
>>>> So far we don't have any internal unique ID, and I would rather not add
>>>> one. I think reconstructing the changes by comparing the data is
>>>>
>>>> (a) unnecessarily complicated
>>>
>>>
>>> i don't agree, why would that be complicated?
>>> when using a content-addressable model diff'ing get's even
>>> extremely efficient.
>>>
>>>> (b) wouldn't work for some cases and therefore break observation
>>>
>>>
>>> i partially agree. however, the current jackrabbit implementation
>>> creates events based on diff'ing states and IMO it works pretty
>>> well in practice. i am not aware of real world uses cases affected.
>>
>>
>> <https://issues.apache.org/jira/browse/JCR-3207>
>
> agreed, but personally i don't consider this a real world use case.
>
> cheers
> stefan

Well, it's part of the spec. If we don't like what the spec says, we 
should argue the point in the Expert Group.

Best regards, Julian

Re: [jr3 microkernel] Change log consolidation

Posted by Stefan Guggisberg <st...@gmail.com>.

On Mon, Feb 6, 2012 at 3:10 PM, Julian Reschke <ju...@gmx.de> wrote:
> On 2012-02-06 15:02, Stefan Guggisberg wrote:
>>
>> On Mon, Feb 6, 2012 at 2:39 PM, Thomas Mueller<mu...@adobe.com>  wrote:
>>>
>>> Hi,
>>>
>>>> Instead of separate Added/Existing/Moved/Removed instances in the
>>>> ChangeTree, did you consider keeping the modified content just as
>>>> (say) TransientNode instances, without trying to keep track of how
>>>> they came to exist?
>>>> Then, when you actually need the change log, you
>>>> should still be able to construct it by diffing the transient tree
>>>> against the persistent base state of the content tree. The only caveat
>>>> I know of is that moves can only be reliably detected for
>>>> referenceable nodes or ones with an equivalent internal unique ID
>>>> (which we shouldn't have trouble doing if needed).
>>>
>>>
>>> So far we don't have any internal unique ID, and I would rather not add
>>> one. I think reconstructing the changes by comparing the data is
>>>
>>> (a) unnecessarily complicated
>>
>>
>> i don't agree, why would that be complicated?
>> when using a content-addressable model diff'ing get's even
>> extremely efficient.
>>
>>> (b) wouldn't work for some cases and therefore break observation
>>
>>
>> i partially agree. however, the current jackrabbit implementation
>> creates events based on diff'ing states and IMO it works pretty
>> well in practice. i am not aware of real world uses cases affected.
>
>
> <https://issues.apache.org/jira/browse/JCR-3207>

agreed, but personally i don't consider this a real world use case.

cheers
stefan

>
>> ...
>
>
> Best regards, Julian

Re: [jr3 microkernel] Change log consolidation

Posted by Julian Reschke <ju...@gmx.de>.

On 2012-02-06 15:02, Stefan Guggisberg wrote:
> On Mon, Feb 6, 2012 at 2:39 PM, Thomas Mueller<mu...@adobe.com>  wrote:
>> Hi,
>>
>>> Instead of separate Added/Existing/Moved/Removed instances in the
>>> ChangeTree, did you consider keeping the modified content just as
>>> (say) TransientNode instances, without trying to keep track of how
>>> they came to exist?
>>> Then, when you actually need the change log, you
>>> should still be able to construct it by diffing the transient tree
>>> against the persistent base state of the content tree. The only caveat
>>> I know of is that moves can only be reliably detected for
>>> referenceable nodes or ones with an equivalent internal unique ID
>>> (which we shouldn't have trouble doing if needed).
>>
>> So far we don't have any internal unique ID, and I would rather not add
>> one. I think reconstructing the changes by comparing the data is
>>
>> (a) unnecessarily complicated
>
> i don't agree, why would that be complicated?
> when using a content-addressable model diff'ing get's even
> extremely efficient.
>
>> (b) wouldn't work for some cases and therefore break observation
>
> i partially agree. however, the current jackrabbit implementation
> creates events based on diff'ing states and IMO it works pretty
> well in practice. i am not aware of real world uses cases affected.

<https://issues.apache.org/jira/browse/JCR-3207>

> ...

Best regards, Julian

Re: [jr3 microkernel] Change log consolidation

Posted by Thomas Mueller <mu...@adobe.com>.

Hi,

>> (a) unnecessarily complicated
>
>i don't agree, why would that be complicated?
>when using a content-addressable model diff'ing get's even
>extremely efficient.

Well, "extremely efficient" is relative. "Diffing" is definitely more
complicated than, well, not diffing. As an example, I know of no database
that tries to re-construct the SQL statements from the diff of the stored
data.

>i partially agree. however, the current jackrabbit implementation
>creates events based on diff'ing states and IMO it works pretty
>well in practice. i am not aware of real world uses cases affected.

We know some cases that don't work (reorder) but I agree they are not that
important. But I believe the current solution is rather complicated.

>personally i don't see the practical value in representing multiple
>reorder
>operations within a single save rather than consolidating them.

My approach is: keep the events as called by the application. I think
that's rather simple. The other approach is: from the events change the
state and then try to re-construct the events from there. That sounds
rather complicated and unnecessary to me, because originally you *had* the
events. Why did you throw them away?

Regards,
Thomas

Re: [jr3 microkernel] Change log consolidation

Posted by Stefan Guggisberg <st...@gmail.com>.

On Mon, Feb 6, 2012 at 2:39 PM, Thomas Mueller <mu...@adobe.com> wrote:
> Hi,
>
>>Instead of separate Added/Existing/Moved/Removed instances in the
>>ChangeTree, did you consider keeping the modified content just as
>>(say) TransientNode instances, without trying to keep track of how
>>they came to exist?
>>Then, when you actually need the change log, you
>>should still be able to construct it by diffing the transient tree
>>against the persistent base state of the content tree. The only caveat
>>I know of is that moves can only be reliably detected for
>>referenceable nodes or ones with an equivalent internal unique ID
>>(which we shouldn't have trouble doing if needed).
>
> So far we don't have any internal unique ID, and I would rather not add
> one. I think reconstructing the changes by comparing the data is
>
> (a) unnecessarily complicated

i don't agree, why would that be complicated?
when using a content-addressable model diff'ing get's even
extremely efficient.

> (b) wouldn't work for some cases and therefore break observation

i partially agree. however, the current jackrabbit implementation
creates events based on diff'ing states and IMO it works pretty
well in practice. i am not aware of real world uses cases affected.

>
> The JCR API does define operations (move, add, remove, reorder...), and
> for observation we need them. The most simple solution is to keep the
> events. Why throw the events away if you need them later.
>
> An example for (b) is multiple reorder operations within the same node.

personally i don't see the practical value in representing multiple reorder
operations within a single save rather than consolidating them.

cheers
stefan

>
> Regards,
> Thomas
>

Re: [jr3 microkernel] Change log consolidation

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On Mon, Feb 6, 2012 at 2:48 PM, Stefan Guggisberg
<st...@gmail.com> wrote:
> BTW, that's how it the transient space is implemented in the current
> jackrabbit-core.

Exactly. I think that also answers Thomas' concern (b).

BR,

Jukka Zitting

Re: [jr3 microkernel] Change log consolidation

Posted by Stefan Guggisberg <st...@gmail.com>.

On Mon, Feb 6, 2012 at 2:27 PM, Jukka Zitting <ju...@gmail.com> wrote:
> Hi,
>
> On Mon, Feb 6, 2012 at 1:53 PM, Michael Dürig <md...@apache.org> wrote:
>>> An alternative that AFAICT achieves the same effect in perhaps an
>>> easier to understand manner would be to use the state of the content
>>> tree instead of changes to that state as the key concept. [...]
>>
>> That's more or less what I started with [1] and what I referred to as can of
>> worms in my original mail. Detecting and normalizing/consolidating moves
>> turned out to be too tricky there. You basically have the same cases I
>> identified (1. - 16.) but its much more difficult to tell them apart.
>
> Instead of separate Added/Existing/Moved/Removed instances in the
> ChangeTree, did you consider keeping the modified content just as
> (say) TransientNode instances, without trying to keep track of how
> they came to exist? Then, when you actually need the change log, you
> should still be able to construct it by diffing the transient tree
> against the persistent base state of the content tree. The only caveat
> I know of is that moves can only be reliably detected for
> referenceable nodes or ones with an equivalent internal unique ID
> (which we shouldn't have trouble doing if needed).

BTW, that's how it the transient space is implemented in the current
jackrabbit-core.

cheers
stefan

>
> BR,
>
> Jukka Zitting

Re: [jr3 microkernel] Change log consolidation

Posted by Michael Dürig <md...@apache.org>.


On 6.2.12 13:27, Jukka Zitting wrote:
> On Mon, Feb 6, 2012 at 1:53 PM, Michael Dürig<md...@apache.org>  wrote:
>>> An alternative that AFAICT achieves the same effect in perhaps an
>>> easier to understand manner would be to use the state of the content
>>> tree instead of changes to that state as the key concept. [...]
>>
>> That's more or less what I started with [1] and what I referred to as can of
>> worms in my original mail. Detecting and normalizing/consolidating moves
>> turned out to be too tricky there. You basically have the same cases I
>> identified (1. - 16.) but its much more difficult to tell them apart.
>
> Instead of separate Added/Existing/Moved/Removed instances in the
> ChangeTree, did you consider keeping the modified content just as
> (say) TransientNode instances, without trying to keep track of how
> they came to exist? Then, when you actually need the change log, you
> should still be able to construct it by diffing the transient tree
> against the persistent base state of the content tree. The only caveat
> I know of is that moves can only be reliably detected for
> referenceable nodes or ones with an equivalent internal unique ID
> (which we shouldn't have trouble doing if needed).

I considered this but didn't really follow that approach since I wanted 
to be as independent from the persisted base as possible.
It might well be a working approach and it might have its own advantages 
but I doubt it will be simpler than my approach. Even with that approach 
there are many cases to consider many of which are edge cases.

Michael

>
> BR,
>
> Jukka Zitting

Re: [jr3 microkernel] Change log consolidation

Posted by Thomas Mueller <mu...@adobe.com>.

Hi,

>Instead of separate Added/Existing/Moved/Removed instances in the
>ChangeTree, did you consider keeping the modified content just as
>(say) TransientNode instances, without trying to keep track of how
>they came to exist?
>Then, when you actually need the change log, you
>should still be able to construct it by diffing the transient tree
>against the persistent base state of the content tree. The only caveat
>I know of is that moves can only be reliably detected for
>referenceable nodes or ones with an equivalent internal unique ID
>(which we shouldn't have trouble doing if needed).

So far we don't have any internal unique ID, and I would rather not add
one. I think reconstructing the changes by comparing the data is

(a) unnecessarily complicated
(b) wouldn't work for some cases and therefore break observation

The JCR API does define operations (move, add, remove, reorder...), and
for observation we need them. The most simple solution is to keep the
events. Why throw the events away if you need them later.

An example for (b) is multiple reorder operations within the same node.

Regards,
Thomas

Re: [jr3 microkernel] Change log consolidation

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On Mon, Feb 6, 2012 at 1:53 PM, Michael Dürig <md...@apache.org> wrote:
>> An alternative that AFAICT achieves the same effect in perhaps an
>> easier to understand manner would be to use the state of the content
>> tree instead of changes to that state as the key concept. [...]
>
> That's more or less what I started with [1] and what I referred to as can of
> worms in my original mail. Detecting and normalizing/consolidating moves
> turned out to be too tricky there. You basically have the same cases I
> identified (1. - 16.) but its much more difficult to tell them apart.

Instead of separate Added/Existing/Moved/Removed instances in the
ChangeTree, did you consider keeping the modified content just as
(say) TransientNode instances, without trying to keep track of how
they came to exist? Then, when you actually need the change log, you
should still be able to construct it by diffing the transient tree
against the persistent base state of the content tree. The only caveat
I know of is that moves can only be reliably detected for
referenceable nodes or ones with an equivalent internal unique ID
(which we shouldn't have trouble doing if needed).

BR,

Jukka Zitting

Re: [jr3 microkernel] Change log consolidation

Posted by Michael Dürig <md...@apache.org>.

> On Fri, Feb 3, 2012 at 7:44 PM, Michael Dürig<md...@apache.org>  wrote:
>> While working on a new transient space implementation [1] I discovered that
>> recreating a list of changes from transient modifications to the hierarchy
>> is like opening a can of worms. Rethinking things it turned out, that it is
>> much simpler to directly consolidate the change log.
>
> Sounds useful, especially for cases where potentially conflicting
> changes need to be merged. The more we can simplify and/or normalize
> changes, the easier merging them is.

Yes that's another reason I was working on this. We have the same 
underlying problem with cluster sync, applying transient changes into 
persistence space and refreshing persistent changes into the transient 
space.

I wasn't thinking about merging based on (consolidated) change logs yet 
but this is a very valid point!

> An alternative that AFAICT achieves the same effect in perhaps an
> easier to understand manner would be to use the state of the content
> tree instead of changes to that state as the key concept. Instead of a
> commit(changeLog, baseState) method we could have a commit(newState,
> baseState) method for persisting changes. The backend could then do a
> diff between the two states, which should result in pretty much the
> same list of changes as the proposed change log consolidation
> mechanism. The only tricky part that I can think of is detecting moves
> of non-referenceable nodes, but even that should be doable with hidden
> UUIDs like what we already have in current Jackrabbit.

That's more or less what I started with [1] and what I referred to as 
can of worms in my original mail. Detecting and 
normalizing/consolidating moves turned out to be too tricky there. You 
basically have the same cases I identified (1. - 16.) but its much more 
difficult to tell them apart.

Apart from that, I think there is really a value in having a stateless 
algorithm. And it is really not too hard to understand ;-)

Michael

[1] http://markmail.org/message/qkkcvtmtapas2cx4

>
> The nice thing about such state-based handling of changes is that the
> transient space in any case needs to maintain a representation of the
> state of the modified tree.
>
> PS. You might already have discussed and dismissed this idea off-list.
> If yes, what was the deal-breaker?
>
> BR,
>
> Jukka Zitting

Re: [jr3 microkernel] Change log consolidation

Posted by Thomas Mueller <mu...@adobe.com>.

Hi,

>Yes, every system doesn't solve "some problem". But what you saying is:
>"I would probably not consolidate the change log [...]. Unless it turns
>out to be a problem in practice [...]"

Yes. I don't see why "unconsolidated" change log should be a problem,
except that it's *possibly* a bit slower. And I believe it's rather
unlikely that it's a performance problem. Actually, I believe
consolidating the change log will actually eat more performance than not
doing that. That means consolidating the change log would be just code
that makes the system slower, and is a potential source of errors.

>Which reads like: lets build jr3, and fix it when it fails in deployment.

You forgot about testing :-)

Regards,
Thomas

Re: [jr3 microkernel] Change log consolidation

Posted by Thomas Mueller <mu...@adobe.com>.

Hi,

I think we should keep the option to consolidate the change log. I think
it's a smart idea. But I believe it's much too early to say whether it
will improve performance of real applications a lot or not. It is just
impossible to say at the moment, because we don't have enough real
applications and use cases. Also, we might be able to use it for
clustering.

So I would prefer if we have a setting to disable change log
consolidation, so we can test it. Similar to other options, for example
support for large child node lists: we currently believe it will solve
many problems, but we also know it is problematic in some cases (because
child node order is not preserved). So we have a setting, and we can play
with that.

Regards,
Thomas

Re: [jr3 microkernel] Change log consolidation

Posted by Michael Dürig <md...@apache.org>.


>> That means designing a potentially broken system ("Unless it turns out
>> to be a problem in practice") right from the start...
>
> No. By your definition every system is broken, because every system
> doesn't solve "some" problem.
>

Yes, every system doesn't solve "some problem". But what you saying is: 
"I would probably not consolidate the change log [...]. Unless it turns 
out to be a problem in practice [...]"

Which reads like: lets build jr3, and fix it when it fails in deployment.

Michael

Re: [jr3 microkernel] Change log consolidation

Posted by Thomas Mueller <mu...@adobe.com>.

Hi,

>The problem here is that everyone on the list has his one favourite set
>of normal and weird cases.

That's true. For me, it's a weird use case if somebody moves a node twice
within the same *transaction* (from "a" to "temp" to "b"), and somebody
else *concurrently* removes the "temp". For me, it would be OK if the
transaction fails because "temp" was removed. I don't know what others
think about it.

>That means designing a potentially broken system ("Unless it turns out
>to be a problem in practice") right from the start...

No. By your definition every system is broken, because every system
doesn't solve "some" problem.

Every feature has a cost. I would try to avoid the cost of features nobody
actually needs. I would even argue good design is minimalistic (make it as
simple as possible but not simpler). But deciding which feature is
required isn't that simple of course.

Your solution certainly seems interesting, and I would keep it if
possible, with an option to disable it.

But I just don't know how useful it is in practice. If the use case
described above (concurrent delete of temp node) is really the only use
case then I think it's not worth it. I have the feeling there are other
use cases, that are more relevant.

Regards,
Thomas

Re: [jr3 microkernel] Change log consolidation

Posted by Michael Dürig <md...@apache.org>.


On 6.2.12 11:30, Thomas Mueller wrote:
> Hi,
>
>> An alternative that AFAICT achieves the same effect in perhaps an
>> easier to understand manner would be to use the state of the content
>> tree instead of changes to that state as the key concept. Instead of a
>> commit(changeLog, baseState) method we could have a commit(newState,
>> baseState) method for persisting changes.
>
> That's what I implemented for my first prototype (sandbox/jackrabbit-j3).
>
>> PS. You might already have discussed and dismissed this idea off-list.
>> If yes, what was the deal-breaker?
>
> I think the deal breaker is the API, which currently uses the 'diff' style
> instead of the 'absolute target state' style.

Yes that's certainly another issue.

> For me, it doesn't matter. Either way is fine (diff style or sending the
> full state). The diff style requires a bit less data to be transferred I
> guess.
>
> However, I personally wouldn't consolidate the change log currently. I
> don't think it's necessary, and for the normal case (which is _not_ a
> randomly generated change log, but a manually generated one) I don't think
> it will help a lot. Also, I don't currently see that would help a lot
> resolving conflicts. The conflicts it can resolve seem to be weird cases
> (my opinion, from what I know so far), which are not that important to be
> resolved. Why would somebody move a node twice? If he does, it's his
> problem (the applications problem). And why would we want to avoid a
> conflict if somebody deletes the intermediate node in the meantime? What I
> mean is, failure to merge such a case would be just fine. Maybe there are
> other, more important cases that I currently don't know about.

The problem here is that everyone on the list has his one favourite set 
of normal and weird cases...

> I would probably not consolidate the change log, because it simpler not
> to. Unless it turns out to be a problem in practice (not just in theory).
> Still, it is an interesting idea.

That means designing a potentially broken system ("Unless it turns out 
to be a problem in practice") right from the start...

>
> Regards,
> Thomas
>

Re: [jr3 microkernel] Change log consolidation

Posted by Michael Dürig <md...@apache.org>.

> The more interesting problem is the conflict handling against a committed
> changelist, or a concurrently applied changelist (thinking of a
> distributed cluster). Maybe the same approach using operations to merge
> concurrent changes (instead of operations vs. persisted state) can be used
> here. If you don't have to look at the state (i.e. possibly the entire
> repository), things might be faster or less memory intensive.
>
> This sounds similar to the Operational transformation (OT) approach [0]
> used for concurrent text editors. However, OT turns out to be quite
> difficult to formally prove even for insert/remove actions in a text, when
> concurrency comes into play. With hierarchy operations, this might even be
> more complex. But maybe worth a thought at least (and more Mathematicians
> like Michi :-) on the team).

Thanks for that link. I've seen this quite some time ago but forgot 
about it... In fact, my approach partially resembles this AFAIU it: 
change the order of operations and if necessary transform them to 
accommodate for the reversed order of execution. Since so far this is 
consolidation only (in contrast to merging concurrent changes) it is 
much simpler though. The gap in complexity can easily be seen in my 
rules 1. - 16.: many of them don't ever apply (the ones marked as 
illegal). That is, they can't ever occur in non concurrent change logs. 
When merging concurrent change logs however the situation is different. 
Then many if not all of these cases would need to be handled as well.

Michael


>
> [0] http://en.wikipedia.org/wiki/Operational_transformation
>
> Cheers,
> Alex
>
> On 06.02.12 12:30, "Thomas Mueller"<mu...@adobe.com>  wrote:
>
>
>> Hi,
>>
>>> An alternative that AFAICT achieves the same effect in perhaps an
>>> easier to understand manner would be to use the state of the content
>>> tree instead of changes to that state as the key concept. Instead of a
>>> commit(changeLog, baseState) method we could have a commit(newState,
>>> baseState) method for persisting changes.
>>
>> That's what I implemented for my first prototype (sandbox/jackrabbit-j3).
>>
>>> PS. You might already have discussed and dismissed this idea off-list.
>>> If yes, what was the deal-breaker?
>>
>> I think the deal breaker is the API, which currently uses the 'diff' style
>> instead of the 'absolute target state' style.
>>
>> For me, it doesn't matter. Either way is fine (diff style or sending the
>> full state). The diff style requires a bit less data to be transferred I
>> guess.
>>
>> However, I personally wouldn't consolidate the change log currently. I
>> don't think it's necessary, and for the normal case (which is _not_ a
>> randomly generated change log, but a manually generated one) I don't think
>> it will help a lot. Also, I don't currently see that would help a lot
>> resolving conflicts. The conflicts it can resolve seem to be weird cases
>> (my opinion, from what I know so far), which are not that important to be
>> resolved. Why would somebody move a node twice? If he does, it's his
>> problem (the applications problem). And why would we want to avoid a
>> conflict if somebody deletes the intermediate node in the meantime? What I
>> mean is, failure to merge such a case would be just fine. Maybe there are
>> other, more important cases that I currently don't know about.
>>
>> I would probably not consolidate the change log, because it simpler not
>> to. Unless it turns out to be a problem in practice (not just in theory).
>> Still, it is an interesting idea.
>>
>> Regards,
>> Thomas
>>
>>
>
>
>

Re: [jr3 microkernel] Change log consolidation

Posted by Alexander Klimetschek <ak...@adobe.com>.

>From looking at the use cases I agree with Thomas - in reality you
probably don't reduce much for an individual change list, since these are
typically adding one or few nodes + a bunch of properties, or removing a
few, but not necessarily mixing multiple move, add & remove operations in
a single session. For some edge cases (as the mentioned move + move back)
it is obviously useful and including it should not hurt, if the algorithm
is proven.

The more interesting problem is the conflict handling against a committed
changelist, or a concurrently applied changelist (thinking of a
distributed cluster). Maybe the same approach using operations to merge
concurrent changes (instead of operations vs. persisted state) can be used
here. If you don't have to look at the state (i.e. possibly the entire
repository), things might be faster or less memory intensive.

This sounds similar to the Operational transformation (OT) approach [0]
used for concurrent text editors. However, OT turns out to be quite
difficult to formally prove even for insert/remove actions in a text, when
concurrency comes into play. With hierarchy operations, this might even be
more complex. But maybe worth a thought at least (and more Mathematicians
like Michi :-) on the team).

[0] http://en.wikipedia.org/wiki/Operational_transformation

Cheers,
Alex

On 06.02.12 12:30, "Thomas Mueller" <mu...@adobe.com> wrote:


>Hi,
>
>>An alternative that AFAICT achieves the same effect in perhaps an
>>easier to understand manner would be to use the state of the content
>>tree instead of changes to that state as the key concept. Instead of a
>>commit(changeLog, baseState) method we could have a commit(newState,
>>baseState) method for persisting changes.
>
>That's what I implemented for my first prototype (sandbox/jackrabbit-j3).
>
>>PS. You might already have discussed and dismissed this idea off-list.
>>If yes, what was the deal-breaker?
>
>I think the deal breaker is the API, which currently uses the 'diff' style
>instead of the 'absolute target state' style.
>
>For me, it doesn't matter. Either way is fine (diff style or sending the
>full state). The diff style requires a bit less data to be transferred I
>guess.
>
>However, I personally wouldn't consolidate the change log currently. I
>don't think it's necessary, and for the normal case (which is _not_ a
>randomly generated change log, but a manually generated one) I don't think
>it will help a lot. Also, I don't currently see that would help a lot
>resolving conflicts. The conflicts it can resolve seem to be weird cases
>(my opinion, from what I know so far), which are not that important to be
>resolved. Why would somebody move a node twice? If he does, it's his
>problem (the applications problem). And why would we want to avoid a
>conflict if somebody deletes the intermediate node in the meantime? What I
>mean is, failure to merge such a case would be just fine. Maybe there are
>other, more important cases that I currently don't know about.
>
>I would probably not consolidate the change log, because it simpler not
>to. Unless it turns out to be a problem in practice (not just in theory).
>Still, it is an interesting idea.
>
>Regards,
>Thomas
>
>



-- 
Alexander Klimetschek
Developer // Adobe (Day) // Berlin - Basel

Re: [jr3 microkernel] Change log consolidation

Posted by Thomas Mueller <mu...@adobe.com>.

Hi,

>An alternative that AFAICT achieves the same effect in perhaps an
>easier to understand manner would be to use the state of the content
>tree instead of changes to that state as the key concept. Instead of a
>commit(changeLog, baseState) method we could have a commit(newState,
>baseState) method for persisting changes.

That's what I implemented for my first prototype (sandbox/jackrabbit-j3).

>PS. You might already have discussed and dismissed this idea off-list.
>If yes, what was the deal-breaker?

I think the deal breaker is the API, which currently uses the 'diff' style
instead of the 'absolute target state' style.

For me, it doesn't matter. Either way is fine (diff style or sending the
full state). The diff style requires a bit less data to be transferred I
guess.

However, I personally wouldn't consolidate the change log currently. I
don't think it's necessary, and for the normal case (which is _not_ a
randomly generated change log, but a manually generated one) I don't think
it will help a lot. Also, I don't currently see that would help a lot
resolving conflicts. The conflicts it can resolve seem to be weird cases
(my opinion, from what I know so far), which are not that important to be
resolved. Why would somebody move a node twice? If he does, it's his
problem (the applications problem). And why would we want to avoid a
conflict if somebody deletes the intermediate node in the meantime? What I
mean is, failure to merge such a case would be just fine. Maybe there are
other, more important cases that I currently don't know about.

I would probably not consolidate the change log, because it simpler not
to. Unless it turns out to be a problem in practice (not just in theory).
Still, it is an interesting idea.

Regards,
Thomas

Re: [jr3 microkernel] Change log consolidation

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On Fri, Feb 3, 2012 at 7:44 PM, Michael Dürig <md...@apache.org> wrote:
> While working on a new transient space implementation [1] I discovered that
> recreating a list of changes from transient modifications to the hierarchy
> is like opening a can of worms. Rethinking things it turned out, that it is
> much simpler to directly consolidate the change log.

Sounds useful, especially for cases where potentially conflicting
changes need to be merged. The more we can simplify and/or normalize
changes, the easier merging them is.

An alternative that AFAICT achieves the same effect in perhaps an
easier to understand manner would be to use the state of the content
tree instead of changes to that state as the key concept. Instead of a
commit(changeLog, baseState) method we could have a commit(newState,
baseState) method for persisting changes. The backend could then do a
diff between the two states, which should result in pretty much the
same list of changes as the proposed change log consolidation
mechanism. The only tricky part that I can think of is detecting moves
of non-referenceable nodes, but even that should be doable with hidden
UUIDs like what we already have in current Jackrabbit.

The nice thing about such state-based handling of changes is that the
transient space in any case needs to maintain a representation of the
state of the modified tree.

PS. You might already have discussed and dismissed this idea off-list.
If yes, what was the deal-breaker?

BR,

Jukka Zitting