You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-dev@jackrabbit.apache.org by Jukka Zitting <ju...@gmail.com> on 2012/11/27 11:56:23 UTC

Handling copies and moves with tree diffs

Hi,

As discussed at length earlier, in oak-core we only keep track of tree
states instead of change logs, i.e. the sequences of changes that lead
from one state to another. This works pretty well in general and
avoids a lot of extra complexity, but poses the question of how to
best produce the JSOP change log that the MicroKernel expects.

The default case where content is simply added, modified or removed is
easily handled by the existing JsopDiff class that maps a tree diff to
the corresponding JSOP operations. The only big problem with this
approach is that it doesn't handle bulk operations like copies and
moves too well, instead it falls back to large add(+remove)
operations. Currently we work around this problem by mapping each
higher-level copy and move operation to separate branch commits that
contain just that operation, which isn't too ideal.

Michael already did quite a bit of work on this problem earlier and
realized that it's essentially impossible to reliably extract all
copies and moves from just a tree diff. He did come up with a pretty
brilliant mechanism for handling that case when we do have the
original change log, but I'd really like to avoid having to keep track
of all changes.

Thus I looked at how we could leverage the following two
simplifications to the problem posed by a general tree diff:

1) For each tree node we keep track of its original location. For
example in KernelNodeState we have a record of the path at which the
node existed in the base revision.

2) We don't need optimal handling of all possible corner cases. As
long as we cover basic copy and move use cases we can let the handling
of trickier situations (like replacing a tree with one of its
subtrees, or swapping two subtrees) degrade to the fallback mechanism
we already have in JsopDiff. The lack of fully accurate move tracking
may cause difficulty in merging concurrent changes over moves, but I
think a few extra potential conflicts over non-trivial moves is a
reasonable price to pay.

I did some work along these lines already earlier with the
CopyAndMoveAwareJsopDiff class you can find inside KernelRootBuilder,
but couldn't make it work properly. Based on discussions with Michael
last week I think we should be able to come up with an algorithm that
works pretty well for this purpose. More on that in follow-ups.

BR,

Jukka Zitting

Re: Handling copies and moves with tree diffs

Posted by Michael Dürig <md...@apache.org>.
Hi,

On 27.11.12 10:56, Jukka Zitting wrote:
> Hi,
>
> As discussed at length earlier, in oak-core we only keep track of tree
> states instead of change logs, i.e. the sequences of changes that lead
> from one state to another. This works pretty well in general and
> avoids a lot of extra complexity, but poses the question of how to
> best produce the JSOP change log that the MicroKernel expects.
>
> The default case where content is simply added, modified or removed is
> easily handled by the existing JsopDiff class that maps a tree diff to
> the corresponding JSOP operations. The only big problem with this
> approach is that it doesn't handle bulk operations like copies and
> moves too well, instead it falls back to large add(+remove)
> operations. Currently we work around this problem by mapping each
> higher-level copy and move operation to separate branch commits that
> contain just that operation, which isn't too ideal.
>
> Michael already did quite a bit of work on this problem earlier and
> realized that it's essentially impossible to reliably extract all
> copies and moves from just a tree diff. He did come up with a pretty
> brilliant mechanism for handling that case when we do have the
> original change log, but I'd really like to avoid having to keep track
> of all changes.

Thanks Jukka for picking up the ball! Our private chat from last week 
triggered some new ideas on my side. Basically I

* reviewed the preconditions: what information do we have or could we 
make available? How does this differ from the preconditions I based my 
original work on change log on [1, 2]?

* Applied the knowledge gathered while working on my change log based 
solution upon the new situation.

> Thus I looked at how we could leverage the following two
> simplifications to the problem posed by a general tree diff:
>
> 1) For each tree node we keep track of its original location. For
> example in KernelNodeState we have a record of the path at which the
> node existed in the base revision.

Right, this matches my observation. However, I think recording the path 
is not sufficient. We need to record the parent node.

>
> 2) We don't need optimal handling of all possible corner cases. As
> long as we cover basic copy and move use cases we can let the handling
> of trickier situations (like replacing a tree with one of its
> subtrees, or swapping two subtrees) degrade to the fallback mechanism
> we already have in JsopDiff. The lack of fully accurate move tracking
> may cause difficulty in merging concurrent changes over moves, but I
> think a few extra potential conflicts over non-trivial moves is a
> reasonable price to pay.

If we record the original parent node for all moved/copied nodes, I 
think it is possible to come up with a change log which is not larger 
than twice the size of a minimal change log. The tricky part here is to 
find out in what order to execute copy and move operations.

In general the size of the change log should be acceptable and for most 
common cases I think it will be even minimal. If we really think we need 
to squeeze the change logs down to minimal, there is sill my initial 
change log consolidator [1, 2].

>
> I did some work along these lines already earlier with the
> CopyAndMoveAwareJsopDiff class you can find inside KernelRootBuilder,
> but couldn't make it work properly. Based on discussions with Michael
> last week I think we should be able to come up with an algorithm that
> works pretty well for this purpose. More on that in follow-ups.

I'll also sketch my idea an follow up with a separate message.

Michael


[1] http://markmail.org/message/3c22jqy6gzzkyxx2
[2] 
http://svn.apache.org/viewvc/jackrabbit/sandbox/jackrabbit-microkernel/src/main/java/org/apache/jackrabbit/state/ChangeLog.java?view=markup
>
> BR,
>
> Jukka Zitting
>

Re: Handling copies and moves with tree diffs

Posted by Michael Dürig <md...@apache.org>.
Hi,

I just committed a proof of concept implementation of below algorithm to 
my GitHub repository at https://github.com/mduerig/json-diff. For 
maximal clarity this is a standalone implementation outside the Oak 
framework.

Michael

On 27.11.12 13:16, Michael Dürig wrote:
>
> Hi,
>
> As mentioned in my previous message, here is a draft of an algorithm
> which should be capable of creating a change log from two trees, such
> that when the change log is applied to the source tree it will transform
> it to the destination tree. The algorithm is capable of coping with copy
> and move operations given that the source node of such operations is
> tracked.
>
> The algorithm relies on a couple of observations:
>
> * Deletions of children needs to be handled before addition in order to
> avoid conflicting child names.
>
> * Actual deletion needs to be deferred, since child nodes of a deleted
> node might still be be referred by a move operation which is only
> detected later in the differencing process. The algorithm does this by
> moving such nodes to a "trash" node which is removed again at the very
> end. This additional move operation might in the worst case cause each
> moved node to be moved twice leading to a (IMO negligible) blow up of
> the change log.
>
> * Reconstructing a change log relies on incrementally applying detected
> changes to the source tree until it matches the destination tree. This
> is the place where tracking of source nodes of move and copy operations
> is crucial.
>
>
> The pseudo code of the algorithm abstracts away from the current data
> structures we use in Oak and is only concerned with nodes. Its main
> purpose is to demonstrate the core of the algorithm so there might be
> still lots of room for improvements and optimisations. Also the tree
> structures used in the algorithm are mutable. However it should
> eventually easily to adapt these to NodeBuilders.
>
> Michael
>
> // Global variable holding the JSOP journal after the diffTree below
> returns.
> jsop = ""
>
> /*
>    Create a JSOP journal, which when applied to tree S will transform
>    it to tree T.
> */
> diffTrees(S, T) {
>    // Create a location (trash) which will temporarily hold removed nodes.
>    // This is necessary since these (or child nodes thereof) might still be
>    // needed in move operations occurring only later in the differencing
> process.
>    X = S.addNode(createUniqueName)
>
>    // The main differencing process starts at the roots of the trees and
>    // progresses recursively level by level.
>    diffNodes(X, S, T)
>
>    // Remove the trash location and all its content
>    jsop += "-" + X.path
> }
>
> /*
>    Create JSOP operations for the differences of the immediate children
>    of trees S and T. Tree X serves as trash.
> */
> diffNode(X, S, T) {
>    deleted = S.childNames \ T.childNames   // set difference
>    added   = T.childNames \ S.childNames
>
>    // Need to take care of deleted nodes first in order to avoid
>    // name clashes when adding new nodes later.
>    for (d : deleted) {
>      t = S.child(d)
>      n = createUniqueName
>
>      // Deleted nodes are moved to trash.
>      t.moveTo(X, n) // Move node t to parent X with name n
>      op = ">" + t.sourceNode.path + ":" + t.path
>      jsop += op
>      S.apply(op)               // Transform S according to the single op
>    }
>
>    for (a : added) {
>      t = T.child(a)
>
>      // assume we can detect a copied node and know its source node
>      if (isCopied(t)) {
>        op = "*" + t.sourceNode.path + ":" + t.path
>      }
>
>      // assume we can detect a moved node and know its source node
>      else if (isMoved(t)) {
>        op = ">" + t.sourceNode.path + ":" + t.path
>      }
>
>      // this is an added node
>      else {
>        op = "+" + t.path
>      }
>
>      jsop += op
>      S.apply(op)               // Transform S according to the single op
>    }
>
>    // assert S.childNames == T.childNames
>    for (c : T.childNames) {
>      diffNode(S.child(c), T.child(c))
>    }
> }
>
>
>
>
> On 27.11.12 10:56, Jukka Zitting wrote:
>> Hi,
>>
>> As discussed at length earlier, in oak-core we only keep track of tree
>> states instead of change logs, i.e. the sequences of changes that lead
>> from one state to another. This works pretty well in general and
>> avoids a lot of extra complexity, but poses the question of how to
>> best produce the JSOP change log that the MicroKernel expects.
>>
>> The default case where content is simply added, modified or removed is
>> easily handled by the existing JsopDiff class that maps a tree diff to
>> the corresponding JSOP operations. The only big problem with this
>> approach is that it doesn't handle bulk operations like copies and
>> moves too well, instead it falls back to large add(+remove)
>> operations. Currently we work around this problem by mapping each
>> higher-level copy and move operation to separate branch commits that
>> contain just that operation, which isn't too ideal.
>>
>> Michael already did quite a bit of work on this problem earlier and
>> realized that it's essentially impossible to reliably extract all
>> copies and moves from just a tree diff. He did come up with a pretty
>> brilliant mechanism for handling that case when we do have the
>> original change log, but I'd really like to avoid having to keep track
>> of all changes.
>>
>> Thus I looked at how we could leverage the following two
>> simplifications to the problem posed by a general tree diff:
>>
>> 1) For each tree node we keep track of its original location. For
>> example in KernelNodeState we have a record of the path at which the
>> node existed in the base revision.
>>
>> 2) We don't need optimal handling of all possible corner cases. As
>> long as we cover basic copy and move use cases we can let the handling
>> of trickier situations (like replacing a tree with one of its
>> subtrees, or swapping two subtrees) degrade to the fallback mechanism
>> we already have in JsopDiff. The lack of fully accurate move tracking
>> may cause difficulty in merging concurrent changes over moves, but I
>> think a few extra potential conflicts over non-trivial moves is a
>> reasonable price to pay.
>>
>> I did some work along these lines already earlier with the
>> CopyAndMoveAwareJsopDiff class you can find inside KernelRootBuilder,
>> but couldn't make it work properly. Based on discussions with Michael
>> last week I think we should be able to come up with an algorithm that
>> works pretty well for this purpose. More on that in follow-ups.
>>
>> BR,
>>
>> Jukka Zitting
>>

Re: Handling copies and moves with tree diffs

Posted by Michael Dürig <mi...@gmail.com>.
Hi,

to track this more easily I committed the pseudo code to svn: 
http://svn.apache.org/viewvc/jackrabbit/oak/trunk/doc/jsop-diff.md?view=markup

A small correction to my initial post is here: 
http://svn.apache.org/viewvc/jackrabbit/oak/trunk/doc/jsop-diff.md?r1=1414189&r2=1414190&pathrev=1414190&view=diff&diff_format=f

Michael


On 27.11.12 13:16, Michael Dürig wrote:
>
> Hi,
>
> As mentioned in my previous message, here is a draft of an algorithm
> which should be capable of creating a change log from two trees, such
> that when the change log is applied to the source tree it will transform
> it to the destination tree. The algorithm is capable of coping with copy
> and move operations given that the source node of such operations is
> tracked.
>
> The algorithm relies on a couple of observations:
>
> * Deletions of children needs to be handled before addition in order to
> avoid conflicting child names.
>
> * Actual deletion needs to be deferred, since child nodes of a deleted
> node might still be be referred by a move operation which is only
> detected later in the differencing process. The algorithm does this by
> moving such nodes to a "trash" node which is removed again at the very
> end. This additional move operation might in the worst case cause each
> moved node to be moved twice leading to a (IMO negligible) blow up of
> the change log.
>
> * Reconstructing a change log relies on incrementally applying detected
> changes to the source tree until it matches the destination tree. This
> is the place where tracking of source nodes of move and copy operations
> is crucial.
>
>
> The pseudo code of the algorithm abstracts away from the current data
> structures we use in Oak and is only concerned with nodes. Its main
> purpose is to demonstrate the core of the algorithm so there might be
> still lots of room for improvements and optimisations. Also the tree
> structures used in the algorithm are mutable. However it should
> eventually easily to adapt these to NodeBuilders.
>
> Michael
>
> // Global variable holding the JSOP journal after the diffTree below
> returns.
> jsop = ""
>
> /*
>    Create a JSOP journal, which when applied to tree S will transform
>    it to tree T.
> */
> diffTrees(S, T) {
>    // Create a location (trash) which will temporarily hold removed nodes.
>    // This is necessary since these (or child nodes thereof) might still be
>    // needed in move operations occurring only later in the differencing
> process.
>    X = S.addNode(createUniqueName)
>
>    // The main differencing process starts at the roots of the trees and
>    // progresses recursively level by level.
>    diffNodes(X, S, T)
>
>    // Remove the trash location and all its content
>    jsop += "-" + X.path
> }
>
> /*
>    Create JSOP operations for the differences of the immediate children
>    of trees S and T. Tree X serves as trash.
> */
> diffNode(X, S, T) {
>    deleted = S.childNames \ T.childNames   // set difference
>    added   = T.childNames \ S.childNames
>
>    // Need to take care of deleted nodes first in order to avoid
>    // name clashes when adding new nodes later.
>    for (d : deleted) {
>      t = S.child(d)
>      n = createUniqueName
>
>      // Deleted nodes are moved to trash.
>      t.moveTo(X, n) // Move node t to parent X with name n
>      op = ">" + t.sourceNode.path + ":" + t.path
>      jsop += op
>      S.apply(op)               // Transform S according to the single op
>    }
>
>    for (a : added) {
>      t = T.child(a)
>
>      // assume we can detect a copied node and know its source node
>      if (isCopied(t)) {
>        op = "*" + t.sourceNode.path + ":" + t.path
>      }
>
>      // assume we can detect a moved node and know its source node
>      else if (isMoved(t)) {
>        op = ">" + t.sourceNode.path + ":" + t.path
>      }
>
>      // this is an added node
>      else {
>        op = "+" + t.path
>      }
>
>      jsop += op
>      S.apply(op)               // Transform S according to the single op
>    }
>
>    // assert S.childNames == T.childNames
>    for (c : T.childNames) {
>      diffNode(S.child(c), T.child(c))
>    }
> }
>
>
>
>
> On 27.11.12 10:56, Jukka Zitting wrote:
>> Hi,
>>
>> As discussed at length earlier, in oak-core we only keep track of tree
>> states instead of change logs, i.e. the sequences of changes that lead
>> from one state to another. This works pretty well in general and
>> avoids a lot of extra complexity, but poses the question of how to
>> best produce the JSOP change log that the MicroKernel expects.
>>
>> The default case where content is simply added, modified or removed is
>> easily handled by the existing JsopDiff class that maps a tree diff to
>> the corresponding JSOP operations. The only big problem with this
>> approach is that it doesn't handle bulk operations like copies and
>> moves too well, instead it falls back to large add(+remove)
>> operations. Currently we work around this problem by mapping each
>> higher-level copy and move operation to separate branch commits that
>> contain just that operation, which isn't too ideal.
>>
>> Michael already did quite a bit of work on this problem earlier and
>> realized that it's essentially impossible to reliably extract all
>> copies and moves from just a tree diff. He did come up with a pretty
>> brilliant mechanism for handling that case when we do have the
>> original change log, but I'd really like to avoid having to keep track
>> of all changes.
>>
>> Thus I looked at how we could leverage the following two
>> simplifications to the problem posed by a general tree diff:
>>
>> 1) For each tree node we keep track of its original location. For
>> example in KernelNodeState we have a record of the path at which the
>> node existed in the base revision.
>>
>> 2) We don't need optimal handling of all possible corner cases. As
>> long as we cover basic copy and move use cases we can let the handling
>> of trickier situations (like replacing a tree with one of its
>> subtrees, or swapping two subtrees) degrade to the fallback mechanism
>> we already have in JsopDiff. The lack of fully accurate move tracking
>> may cause difficulty in merging concurrent changes over moves, but I
>> think a few extra potential conflicts over non-trivial moves is a
>> reasonable price to pay.
>>
>> I did some work along these lines already earlier with the
>> CopyAndMoveAwareJsopDiff class you can find inside KernelRootBuilder,
>> but couldn't make it work properly. Based on discussions with Michael
>> last week I think we should be able to come up with an algorithm that
>> works pretty well for this purpose. More on that in follow-ups.
>>
>> BR,
>>
>> Jukka Zitting
>>

Re: Handling copies and moves with tree diffs

Posted by Michael Dürig <md...@apache.org>.
Hi,

As mentioned in my previous message, here is a draft of an algorithm 
which should be capable of creating a change log from two trees, such 
that when the change log is applied to the source tree it will transform 
it to the destination tree. The algorithm is capable of coping with copy 
and move operations given that the source node of such operations is 
tracked.

The algorithm relies on a couple of observations:

* Deletions of children needs to be handled before addition in order to 
avoid conflicting child names.

* Actual deletion needs to be deferred, since child nodes of a deleted 
node might still be be referred by a move operation which is only 
detected later in the differencing process. The algorithm does this by 
moving such nodes to a "trash" node which is removed again at the very 
end. This additional move operation might in the worst case cause each 
moved node to be moved twice leading to a (IMO negligible) blow up of 
the change log.

* Reconstructing a change log relies on incrementally applying detected 
changes to the source tree until it matches the destination tree. This 
is the place where tracking of source nodes of move and copy operations 
is crucial.


The pseudo code of the algorithm abstracts away from the current data 
structures we use in Oak and is only concerned with nodes. Its main 
purpose is to demonstrate the core of the algorithm so there might be 
still lots of room for improvements and optimisations. Also the tree 
structures used in the algorithm are mutable. However it should 
eventually easily to adapt these to NodeBuilders.

Michael

// Global variable holding the JSOP journal after the diffTree below 
returns.
jsop = ""

/*
   Create a JSOP journal, which when applied to tree S will transform
   it to tree T.
*/
diffTrees(S, T) {
   // Create a location (trash) which will temporarily hold removed nodes.
   // This is necessary since these (or child nodes thereof) might still be
   // needed in move operations occurring only later in the differencing 
process.
   X = S.addNode(createUniqueName)

   // The main differencing process starts at the roots of the trees and
   // progresses recursively level by level.
   diffNodes(X, S, T)

   // Remove the trash location and all its content
   jsop += "-" + X.path
}

/*
   Create JSOP operations for the differences of the immediate children
   of trees S and T. Tree X serves as trash.
*/
diffNode(X, S, T) {
   deleted = S.childNames \ T.childNames   // set difference
   added   = T.childNames \ S.childNames

   // Need to take care of deleted nodes first in order to avoid
   // name clashes when adding new nodes later.
   for (d : deleted) {
     t = S.child(d)
     n = createUniqueName

     // Deleted nodes are moved to trash.
     t.moveTo(X, n) // Move node t to parent X with name n
     op = ">" + t.sourceNode.path + ":" + t.path
     jsop += op
     S.apply(op)               // Transform S according to the single op
   }

   for (a : added) {
     t = T.child(a)

     // assume we can detect a copied node and know its source node
     if (isCopied(t)) {
       op = "*" + t.sourceNode.path + ":" + t.path
     }

     // assume we can detect a moved node and know its source node
     else if (isMoved(t)) {
       op = ">" + t.sourceNode.path + ":" + t.path
     }

     // this is an added node
     else {
       op = "+" + t.path
     }

     jsop += op
     S.apply(op)               // Transform S according to the single op
   }

   // assert S.childNames == T.childNames
   for (c : T.childNames) {
     diffNode(S.child(c), T.child(c))
   }
}




On 27.11.12 10:56, Jukka Zitting wrote:
> Hi,
>
> As discussed at length earlier, in oak-core we only keep track of tree
> states instead of change logs, i.e. the sequences of changes that lead
> from one state to another. This works pretty well in general and
> avoids a lot of extra complexity, but poses the question of how to
> best produce the JSOP change log that the MicroKernel expects.
>
> The default case where content is simply added, modified or removed is
> easily handled by the existing JsopDiff class that maps a tree diff to
> the corresponding JSOP operations. The only big problem with this
> approach is that it doesn't handle bulk operations like copies and
> moves too well, instead it falls back to large add(+remove)
> operations. Currently we work around this problem by mapping each
> higher-level copy and move operation to separate branch commits that
> contain just that operation, which isn't too ideal.
>
> Michael already did quite a bit of work on this problem earlier and
> realized that it's essentially impossible to reliably extract all
> copies and moves from just a tree diff. He did come up with a pretty
> brilliant mechanism for handling that case when we do have the
> original change log, but I'd really like to avoid having to keep track
> of all changes.
>
> Thus I looked at how we could leverage the following two
> simplifications to the problem posed by a general tree diff:
>
> 1) For each tree node we keep track of its original location. For
> example in KernelNodeState we have a record of the path at which the
> node existed in the base revision.
>
> 2) We don't need optimal handling of all possible corner cases. As
> long as we cover basic copy and move use cases we can let the handling
> of trickier situations (like replacing a tree with one of its
> subtrees, or swapping two subtrees) degrade to the fallback mechanism
> we already have in JsopDiff. The lack of fully accurate move tracking
> may cause difficulty in merging concurrent changes over moves, but I
> think a few extra potential conflicts over non-trivial moves is a
> reasonable price to pay.
>
> I did some work along these lines already earlier with the
> CopyAndMoveAwareJsopDiff class you can find inside KernelRootBuilder,
> but couldn't make it work properly. Based on discussions with Michael
> last week I think we should be able to come up with an algorithm that
> works pretty well for this purpose. More on that in follow-ups.
>
> BR,
>
> Jukka Zitting
>

Re: Handling copies and moves with tree diffs

Posted by Michael Dürig <md...@apache.org>.
Hi Jukka,

Thanks for sharing this. In a nutshell, what you are proposing to do is

* to keep no more than one move/copy operation pending and purge changes 
to the Microkernel as soon as another such operation arrives,

* in the generalised version, to only keep "unconflicting" move/copy 
operations pending and purge as soon as a conflicting one arrives.

Unconflicting means, that given any two move operations none of the four 
involved paths overlap. Where two paths overlap if one is an ancestor of 
the other.


This algorithm - though less general than the one I proposed - seems 
like a good choice for the time being. It is much clearer how to 
implement it in the existing code base. If it turns out that we need a 
more rigorous move/copy handling, we have still the other approach as a 
backup.


However, both our approached share a common problem: they don't work 
across a RootImpl.rebase since after rebasing the tracking information, 
which was initially available through the node builders, is lost. 
Pushing the rebase logic down to the Microkernel would solve this 
immediate problem but would lose us the ConflictHandler, which is 
currently used to annotate conflicting commits and which allows users to 
edit these before retrying the commit.

Michael


On 27.11.12 14:43, Jukka Zitting wrote:
> Hi,
>
> On Tue, Nov 27, 2012 at 12:56 PM, Jukka Zitting <ju...@gmail.com> wrote:
>> I did some work along these lines already earlier with the
>> CopyAndMoveAwareJsopDiff class you can find inside KernelRootBuilder,
>> but couldn't make it work properly. Based on discussions with Michael
>> last week I think we should be able to come up with an algorithm that
>> works pretty well for this purpose. More on that in follow-ups.
>
> Here's one simple algorithm that should be able to correctly handle up
> to a single copy or move per commit. First the handling of changes in
> KernelNodeBuilder:
>
> A1) In KernelRootBuilder we keep track of the path of the *target* of
> a copy or move operation. Until such an operation occurs, that path
> remains null.
>
> A2) In KernelNodeBuilder.setNode() we check if the new child NodeState
> is based on the same MK revision as the parent but has a different
> path than the identified new child (and is not one of its
> descendants).
>
> A21) If not, we use normal tree diff to find out which basic
> add/set/remove operations to update the builder subtree to match that
> NodeState.
>
> A22) If yes, we update the target path in KernelRootBuilder to point
> to the new child node and use the given NodeState as the base of that
> builder subtree.
>
> If A22 is  reached when the target path in KernelRootBuilder is
> already set, all changes are automatically purged to a branch and the
> KernelRootBuilder reset to the new purged state before continuing the
> A22 step. Changes are also purged whenever a predefined purge limit is
> reached or commit is requested from a higher level. The JSOP diff for
> that purge commit is constructed as follows:
>
> B1) If the target path in KernelRootBuilder is set (and the
> copied/moved node at that path still exists), we first construct the
> relevant copy or move operation:
>
> B11) If the target bath already existed in the previous revision,
> we're dealing with a replacement. Emit a JSOP delete instruction for
> that path.
>
> B12) If the original path of that subtree still exists, we're dealing
> with a copy operation. Emit  the JSOP copy instruction.
>
> B13) If the original path no longer exists, we're dealing with a move
> operation. Emit the JSOP move operation and keep track of the original
> removed path.
>
> B2) Process the rest of the changes using the normal JsopDiff
> mechanism with the following modifications:
>
> B21) When encountering the target of the copy/move operation, instead
> of processing it normally as an added or modified subtree do a
> separate JsopDiff against the base state from the original source path
> to capture any post-copy/move changes to that subtree.
>
> B22) When encountering the removed path from B13, suppress the normal
> JSOP delete instruction that would have been generated by the standard
> JsopDiff class.
>
> It should be straightforward to generalize this algorithm to cover any
> number of *unconflicting* copies/moves in a single commit (conflicting
> changes would still need to be handled as a sequence of purged branch
> commits). Most notably this approach should be able to accurately map
> all JCR-level direct-to-workspace copy/move operations (including more
> complex things like checkin) to single MK commits.
>
> BR,
>
> Jukka Zitting
>

Re: Handling copies and moves with tree diffs

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Tue, Nov 27, 2012 at 12:56 PM, Jukka Zitting <ju...@gmail.com> wrote:
> I did some work along these lines already earlier with the
> CopyAndMoveAwareJsopDiff class you can find inside KernelRootBuilder,
> but couldn't make it work properly. Based on discussions with Michael
> last week I think we should be able to come up with an algorithm that
> works pretty well for this purpose. More on that in follow-ups.

Here's one simple algorithm that should be able to correctly handle up
to a single copy or move per commit. First the handling of changes in
KernelNodeBuilder:

A1) In KernelRootBuilder we keep track of the path of the *target* of
a copy or move operation. Until such an operation occurs, that path
remains null.

A2) In KernelNodeBuilder.setNode() we check if the new child NodeState
is based on the same MK revision as the parent but has a different
path than the identified new child (and is not one of its
descendants).

A21) If not, we use normal tree diff to find out which basic
add/set/remove operations to update the builder subtree to match that
NodeState.

A22) If yes, we update the target path in KernelRootBuilder to point
to the new child node and use the given NodeState as the base of that
builder subtree.

If A22 is  reached when the target path in KernelRootBuilder is
already set, all changes are automatically purged to a branch and the
KernelRootBuilder reset to the new purged state before continuing the
A22 step. Changes are also purged whenever a predefined purge limit is
reached or commit is requested from a higher level. The JSOP diff for
that purge commit is constructed as follows:

B1) If the target path in KernelRootBuilder is set (and the
copied/moved node at that path still exists), we first construct the
relevant copy or move operation:

B11) If the target bath already existed in the previous revision,
we're dealing with a replacement. Emit a JSOP delete instruction for
that path.

B12) If the original path of that subtree still exists, we're dealing
with a copy operation. Emit  the JSOP copy instruction.

B13) If the original path no longer exists, we're dealing with a move
operation. Emit the JSOP move operation and keep track of the original
removed path.

B2) Process the rest of the changes using the normal JsopDiff
mechanism with the following modifications:

B21) When encountering the target of the copy/move operation, instead
of processing it normally as an added or modified subtree do a
separate JsopDiff against the base state from the original source path
to capture any post-copy/move changes to that subtree.

B22) When encountering the removed path from B13, suppress the normal
JSOP delete instruction that would have been generated by the standard
JsopDiff class.

It should be straightforward to generalize this algorithm to cover any
number of *unconflicting* copies/moves in a single commit (conflicting
changes would still need to be handled as a sequence of purged branch
commits). Most notably this approach should be able to accurately map
all JCR-level direct-to-workspace copy/move operations (including more
complex things like checkin) to single MK commits.

BR,

Jukka Zitting

Re: Handling copies and moves with tree diffs

Posted by Michael Dürig <md...@apache.org>.

On 27.11.12 12:39, Thomas Mueller wrote:
>> Currently we work around this problem by mapping each
>> >higher-level copy and move operation to separate branch commits that
>> >contain just that operation, which isn't too ideal.
> Could you tell me what problems you see with this workaround? As far as I
> know, move and copy are not common operations, do we need to optimize for
> it?
>

Probably the biggest issue here is 
https://issues.apache.org/jira/browse/OAK-464

Michael

Re: Handling copies and moves with tree diffs

Posted by Thomas Mueller <mu...@adobe.com>.
Hi,

>1) For each tree node we keep track of its original location.

> However, I think recording the path is not sufficient. We need to record
>the parent node.

I guess you don't mean remembering the original path of the parent?
keeping track of the original location of the parent (let's say "/test")
instead of keeping track of the original location of the node (let's say
"/test/node") is just keeping less information, and can't be better I
guess.


>2) We don't need optimal handling of all possible corner cases.

As long as it always works correctly that's fine. But proving that a diff
based solution always works is quite tricky in my view. I'm a bit afraid
that we build a solution that is hard to understand and maintain, tricky
to get right _and_ fast at the same time.

> Currently we work around this problem by mapping each
> higher-level copy and move operation to separate branch commits that
> contain just that operation, which isn't too ideal.

Could you tell me what problems you see with this workaround? As far as I
know, move and copy are not common operations, do we need to optimize for
it?

> in oak-core we only keep track of tree states instead of change logs ...
> This works pretty well in general and avoids a lot of extra complexity


I have to admit I didn't recently read the source code of this area, but
conceptually I have the feeling that keeping track of the change log is
actually quite simple (keeping an simple list of change operations), and
trying to build the change log from the state sounds like a lot more
complex to me. What is in your view the extra complexity in keeping a list
of change operations?

Regards,
Thomas