You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@subversion.apache.org by Stefan Fuhrmann <st...@alice-dsl.de> on 2012/04/24 22:51:55 UTC

Two-phase merge

Hi all,

This one is basically the technical post promised in my
"Merge Policies" post. The proposal I make here is not
an alternative to e.g. "Symmetric Merge" but rather a
refined merge strategy that would also benefit the current
merge.

Today, the merge algorithm will do the following for a
merge from A -> B:

* read / evaluate the merge info and change history
   to find the list of revisions to merge
* fragment the list of revisions where they are non-
   contiguous; fragment = a revision range to merge
* iteratively merge all fragments from A to B,
   optionally resolve conflicts after each fragment

In end, we will always merge individual file or property
changes but we plan the merge on tree-level. For many
file nodes, this will unnecessary intermediate states
where the merge got fragmented due to a change to
some *other* file. This increases the probability of
merge conflicts and their likelihood to show up again
after each conflict resolution. Moreover, it masks cases
where Symmetric Merge might select an optimized
merge plan (see wiki for various examples).

I propose to use the following two-phase merge workflow:

(1) Planning phase
- read / evaluate the merge info and change history
   to find the list of revisions to merge
- break that down for each node changed on the source side
- optionally handle renaming etc.
- mark tree conflicts (not sure when to signal them to the
   user and when to resolve them; high-level fragmentation
   might become necessary)

(2) Execution phase
- merge changes on a per-node basis, i.e. all changes of
   that node get merged before the next node
- fragmentation may still be necessary for *that file*
- given the "merge plan" for a node, alternative merge
   strategies may be chosen

That's all I currently have on merge and is not very
very long-term ;)

-- Stefan^2.

Re: Two-phase merge

Posted by Julian Foad <ju...@btopenworld.com>.

On 29 April, Stefan Fuhrmann wrote:
> Julian Foad wrote:
>> Stefan Fuhrmann wrote:
>>> [...] The proposal I make here is not
>>> an alternative to e.g. "Symmetric Merge" but rather a
>>> refined merge strategy that would also benefit the current
>>> merge.
[...]
>>> In end, we will always merge individual file or property
>>> changes but we plan the merge on tree-level. For many
>>> file nodes, this will unnecessary intermediate states
>>> where the merge got fragmented due to a change to
>>> some *other* file.
>> 
>> Yes.  That is likely to be a significant cause of extra conflict in cases
>> where the user has sub-tree mergeinfo or has done some cherry-pick merges
>> and is now doing a "merge everything else" merge.
> 
> It also happens in the "back and forth" case:


If this problem does arise in basic back-and-forth merging then I would agree it's much more important than I thought, 
but I don't get it.  If "it" 
refers to fragmenting the merge, then no, each merge in a back-and-forth case doesn't get fragmented.

> if the changes in the source are interspersed
> with merges from the target, i.e. the merges
> from A to B will be excluded from the changes
> to be merged from B to A.

Do you mean like this?

            A2      A3      A4
  A o-------o-------o-------o

   /         \             /

  o           \           /

   \           \         /

  B o---o-------o-------o

        B2      B3      B4   


The merge we're considering is the one shown from B4 to A4.  Changes B2 and B4 are changes in the source, interspersed with B3 which is a merge from the target, that is the result of a merge from A to B.

A2 will be chosen as the base for a 3-way merge, and the two arms of this merge will be (A2:B4) as the source side and (A2:A4) as the target side.  The source side, (A2:B4), is equivalent to the combination of B2 and B4.  It can be a bit tricky to understand why it is equivalent; if so, see <>.  The point is that changes B2 and B4 don't get split up and merged one after the other in this scenario.



>>>  This increases the probability of
>>>  merge conflicts and their likelihood to show up again
>>>  after each conflict resolution.
>> (That is: after each phase of the N revision-range phases that the merge 
>> got broken into.)
>> 
>>>  Moreover, it masks cases
>>>  where Symmetric Merge might select an optimized
>>>  merge plan (see wiki for various examples).
>>  I understand, but I don't know of any examples on the Wiki yet.
> 
> E.g. in a merge from A to B, we can simply take the merge
> result of the last B to A merge if there were no later changes
> on either side and the B -> A merge did not cherry-pick
> (with respect to the node in question).

>>>  I propose to use the following two-phase merge workflow:
>>> 
>>>  (1) Planning phase
>>>  - read / evaluate the merge info and change history
>>>     to find the list of revisions to merge
>>>  - break that down for each node changed on the source side
>>>  - optionally handle renaming etc.
>>>  - mark tree conflicts (not sure when to signal them to the
>>>     user and when to resolve them; high-level fragmentation
>>>     might become necessary)
>>> 
>>>  (2) Execution phase
>>>  - merge changes on a per-node basis, i.e. all changes of
>>>     that node get merged before the next node
>>>  - fragmentation may still be necessary for *that file*
>>>  - given the "merge plan" for a node, alternative merge
>>>     strategies may be chosen
>> 
>> OK, the significant change here is not so much the two-phase (although that 
>> has merit in itself) but rather turning the processing order inside-out from 
>> (conceptually)
>> 
>>     REV_RANGES = [contiguous-revision-ranges needed for TREE]
>>     for REV_RANGE in REV_RANGES:
>>       3-way merge REV_RANGE into the whole TREE
>> 
>> to (conceptually)
>> 
>>     for each NODE in the TREE:
>>       MERGES = [find the best sequence of 3-way merges to merge all needed 
>> changes into NODE]
>>       for MERGE in MERGES:
>>         3-way-merge MERGE into NODE
> 
> Correctly. With the tiny addition that all rename
> and tree conflict handling is done in the planning
> phase. That makes it repeatable and potentially
> reduces the number of conflicts (at most one per
> node instead of one per node and merge step).

I get how it reduces conflicts.  What do you mean by 'repeatable'?

> The second phase is also reduced to purely text-
> based processing because all the "structure"
> issues have already been resolved. But that might
> be more of a mental aid instead of a true reduction
> in complexity (we still might need to handle missing
> targets on disk etc.).


>> This is a good idea in itself.  It certainly brings advantages as
>> mentioned above.  I'm not sure what, if any, disadvantages it might have.
> 
> Memory consumption could be proportional to
> the number of changes and renames or such.
> Should not be a real-world problem as long as
> you don't merge more than a few 100k changes.
> 
> A more severe issue might be latency because
> the planning phase needs to gather / evaluate
> data from large sections of history. But that one
> can be improved ;)

Right.

>> I don't know that the scenarios this helps with are our main concern at
>> this point.  It would be good to keep this approach in mind when writing
>> or  designing any new merge code.  In terms of development effort I think
>> it's  more valuable to concentrate first on the advantages to to-and-fro
>> merging (most  commonly without subtree merges and without cherry-picks)
>> that the  'symmetric merge' brings.
> 
> I think it can become helpful as soon as we track renames because
> we need to remap on a per-node basis.

To track a renamed node, we need to be able to recognize and act on the information that each node in a merge will be found at three paths that may all be different: its path in the merge-base (aka source-left) tree, its path in the source (aka source-right) tree, and its path in the target tree.  Is that what you mean by "remap on a per-node basis"?  I think that's something we will need to do anyway, but I don't see that it makes much difference whether we do this once at the planning stage (for a two-phase merge), or do it separately for each sub-merge (for the split-into-revision-ranges method).

- Julian

Re: Two-phase merge

Posted by Stefan Fuhrmann <eq...@web.de>.

Julian Foad wrote:
> Stefan Fuhrmann wrote:
>
>> This one is basically the technical post promised in my
>> "Merge Policies" post. The proposal I make here is not
>> an alternative to e.g. "Symmetric Merge" but rather a
>> refined merge strategy that would also benefit the current
>> merge.
>>
>> Today, the merge algorithm will do the following for a
>> merge from A ->  B:
>>
>> * read / evaluate the merge info and change history
>>    to find the list of revisions to merge
>> * fragment the list of revisions where they are non-
>>    contiguous; fragment = a revision range to merge
>> * iteratively merge all fragments from A to B,
>>    optionally resolve conflicts after each fragment
>>
>> In end, we will always merge individual file or property
>> changes but we plan the merge on tree-level. For many
>> file nodes, this will unnecessary intermediate states
>> where the merge got fragmented due to a change to
>> some *other* file.
> Yes.  That is likely to be a significant cause of extra conflict in cases where the user has sub-tree mergeinfo or has done some cherry-pick merges and is now doing a "merge everything else" merge.

It also happens in the "back and forth" case:
if the changes in the source are interspersed
with merges from the target, i.e. the merges
from A to B will be excluded from the changes
to be merged from B to A.
>> This increases the probability of
>> merge conflicts and their likelihood to show up again
>> after each conflict resolution.
> (That is: after each phase of the N revision-range phases that the merge got broken into.)
>
>> Moreover, it masks cases
>> where Symmetric Merge might select an optimized
>> merge plan (see wiki for various examples).
> I understand, but I don't know of any examples on the Wiki yet.

E.g. in a merge from A to B, we can simply take the merge
result of the last B to A merge if there were no later changes
on either side and the B -> A merge did not cherry-pick
(with respect to the node in question).
>> I propose to use the following two-phase merge workflow:
>>
>> (1) Planning phase
>> - read / evaluate the merge info and change history
>>    to find the list of revisions to merge
>> - break that down for each node changed on the source side
>> - optionally handle renaming etc.
>> - mark tree conflicts (not sure when to signal them to the
>>    user and when to resolve them; high-level fragmentation
>>    might become necessary)
>>
>> (2) Execution phase
>> - merge changes on a per-node basis, i.e. all changes of
>>    that node get merged before the next node
>> - fragmentation may still be necessary for *that file*
>> - given the "merge plan" for a node, alternative merge
>>    strategies may be chosen
> OK, the significant change here is not so much the two-phase (although that has merit in itself) but rather turning the processing order inside-out from (conceptually)
>
>    REV_RANGES = [contiguous-revision-ranges needed for TREE]
>    for REV_RANGE in REV_RANGES:
>      3-way merge REV_RANGE into the whole TREE
>
> to (conceptually)
>
>    for each NODE in the TREE:
>      MERGES = [find the best sequence of 3-way merges to merge all needed changes into NODE]
>      for MERGE in MERGES:
>        3-way-merge MERGE into NODE

Correctly. With the tiny addition that all rename
and tree conflict handling is done in the planning
phase. That makes it repeatable and potentially
reduces the number of conflicts (at most one per
node instead of one per node and merge step).

The second phase is also reduced to purely text-
based processing because all the "structure"
issues have already been resolved. But that might
be more of a mental aid instead of a true reduction
in complexity (we still might need to handle missing
targets on disk etc.).
> This is a good idea in itself.  It certainly brings advantages as mentioned above.  I'm not sure what, if any, disadvantages it might have.

Memory consumption could be proportional to
the number of changes and renames or such.
Should not be a real-world problem as long as
you don't merge more than a few 100k changes.

A more severe issue might be latency because
the planning phase needs to gather / evaluate
data from large sections of history. But that one
can be improved ;)
> I don't know that the scenarios this helps with are our main concern at this point.  It would be good to keep this approach in mind when writing or designing any new merge code.  In terms of development effort I think it's more valuable to concentrate first on the advantages to to-and-fro merging (most commonly without subtree merges and without cherry-picks) that the 'symmetric merge' brings.

I think it can become helpful as soon as we
track renames because we need to remap
on a per-node basis.

-- Stefan^2.

Re: Two-phase merge

Posted by Julian Foad <ju...@btopenworld.com>.

Stefan Fuhrmann wrote:

> This one is basically the technical post promised in my
> "Merge Policies" post. The proposal I make here is not
> an alternative to e.g. "Symmetric Merge" but rather a
> refined merge strategy that would also benefit the current
> merge.
> 
> Today, the merge algorithm will do the following for a
> merge from A -> B:
> 
> * read / evaluate the merge info and change history
>   to find the list of revisions to merge
> * fragment the list of revisions where they are non-
>   contiguous; fragment = a revision range to merge
> * iteratively merge all fragments from A to B,
>   optionally resolve conflicts after each fragment
> 
> In end, we will always merge individual file or property
> changes but we plan the merge on tree-level. For many
> file nodes, this will unnecessary intermediate states
> where the merge got fragmented due to a change to
> some *other* file.

Yes.  That is likely to be a significant cause of extra conflict in cases where the user has sub-tree mergeinfo or has done some cherry-pick merges and is now doing a "merge everything else" merge.

> This increases the probability of
> merge conflicts and their likelihood to show up again
> after each conflict resolution.

(That is: after each phase of the N revision-range phases that the merge got broken into.)

> Moreover, it masks cases
> where Symmetric Merge might select an optimized
> merge plan (see wiki for various examples).

I understand, but I don't know of any examples on the Wiki yet.

> I propose to use the following two-phase merge workflow:
> 
> (1) Planning phase
> - read / evaluate the merge info and change history
>   to find the list of revisions to merge
> - break that down for each node changed on the source side
> - optionally handle renaming etc.
> - mark tree conflicts (not sure when to signal them to the
>   user and when to resolve them; high-level fragmentation
>   might become necessary)
> 
> (2) Execution phase
> - merge changes on a per-node basis, i.e. all changes of
>   that node get merged before the next node
> - fragmentation may still be necessary for *that file*
> - given the "merge plan" for a node, alternative merge
>   strategies may be chosen

OK, the significant change here is not so much the two-phase (although that has merit in itself) but rather turning the processing order inside-out from (conceptually)

  REV_RANGES = [contiguous-revision-ranges needed for TREE]
  for REV_RANGE in REV_RANGES:
    3-way merge REV_RANGE into the whole TREE

to (conceptually)

  for each NODE in the TREE:
    MERGES = [find the best sequence of 3-way merges to merge all needed changes into NODE]
    for MERGE in MERGES:
      3-way-merge MERGE into NODE

This is a good idea in itself.  It certainly brings advantages as mentioned above.  I'm not sure what, if any, disadvantages it might have.

I don't know that the scenarios this helps with are our main concern at this point.  It would be good to keep this approach in mind when writing or designing any new merge code.  In terms of development effort I think it's more valuable to concentrate first on the advantages to to-and-fro merging (most commonly without subtree merges and without cherry-picks) that the 'symmetric merge' brings.

- Julian