You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Michael Haggerty <mh...@alum.mit.edu> on 2006/10/25 22:55:33 UTC

[merge tracking] Proposal to track "novel changes"

[This topic was discussed with dlr yesterday on IRC; see

    http://svn.haxx.se/dev/archive-2006-10/0697.shtml

for a transcript.]

I'm no merge guru, so this could be completely silly (or well-known
technology).  Nor have I hacked on the internals of svn, so it could be
totally impractical to implement (I would therefore greatly appreciate
feedback.)  And not being an svn hacker, I unfortunately can't "STFU and
write some code" in any timely manner.  But here we go...


I think that the merge-tracking scheme that is used by svnmerge.py and
by the proposed merge-tracking extension to SVN will probably work
pretty well for simple merge topologies.  But if there are any loops in
the undirected graph of branches between which revisions have been
merged, then it will break down.

For example, if I merge a change from branch B to branch A, and the same
change from branch B to branch C, then there is nothing to stop me
merging the same change from branch A to branch C, resulting in a
repeated merge and a conflicted or even bogus result.

    B
   /|
  A |
   \|
    C

Only in the case that the deltas are identical for B->A and B->C will
the merge code recognize that A->C is a noop.  This problem is intrinsic
to merge tracking the way that it has been implemented in svnmerge.py.
Usually it is not noticed because people use a very simple merge
topology (out of fear?).


It seems to me that what one wants to track in a merge-tracking system
is the creation and motion of "novel changes".  [On IRC I referred to
these as "original revisions", but I think the new name is less
ambiguous.]  I define a novel change to be one created by a developer by
the application of brainpower and emacs (as opposed to a change
resulting from a merge):

    novel change == a change made from scratch (not via a merge)

Most of the intellectual content of a project is in the novel changes,
whereas (I'm exaggerating here) merges are mostly a matter of
bookkeeping and tidying up.

As far as an SCM is concerned, a novel change is an indivisible nugget
of functionality, initially embodied as a single changeset applied to a
particular parent revision.  The proposal is this:

1. For each commit, record the novel change(s) that it involved.

   a. If the commit did not involve a merge, then it is by definition a
novel change.

   b. If a commit resulted from merging a set of revisions from another
branch, then the novel change set associated with the new commit is the
union of the novel changes associated with each of the merged-in
revisions.  Such a commit is *not* itself considered a novel change,
even if the merge involved the manual repair of merge conflicts.

2. Each version of a branch has a set of accumulated novel changes
consisting of the union of all novel changes associated with all earlier
commits.

3. A novel change is never merged into a branch if it is already in the
branch's set of accumulated novel changes.  This restriction also
applies if the novel change is merged indirectly, for example by merging
a revision that itself included the same novel change.

4. If a file is copied, its accumulated-novel-changes are copied as well.

For example, consider the following sequence of events:

1. A novel change is made on branch B, creating revision B:3.  Resulting
metadata:

    novel-changes[B:3] = {B:3}
    accumulated-novel-changes[B:3] = {B:1,B:2,B:3} = {B:1-3}

2. B:3 is merged to branch A, creating A:5.  Resulting metadata:

    novel-changes[A:5] = {B:3}
    accumulated-novel-changes[A:5] = {A:1-4,B:3}

3. B:3 is independently merged to branch C, creating C:10.  Resulting
metadata:

    novel-changes[C:10] = {B:3}
    accumulated-novel-changes[C:10] = {C:1-9,B:3}

The deltas associated with B:3, A:5, and C:10 differ textually, but they
are nevertheless conceptually equivalent, because they are different
representations of the same novel change.

Therefore, if somebody attempts to merge A:5 into branch C, the merge
code should treat it as a noop, because the novel change embodied in A:5
(namely B:3) is already in the accumulated-novel-changes of branch C.
This determination should be made based only on the merge-tracking
metadata, without looking at the diffs at all, and it should not be
considered a conflict (though an informational message would be helpful).

Suppose somebody now wants to add the same novel change into branch D,
creating D:13.  She has the choice of whether to merge B:3, A:5, or
C:10.  As far as the merge-tracking system is concerned, it makes no
difference because they are just three representations of the same novel
change (namely the one originally made in B:3).  But of course the three
diffs are textually different, so the developer should choose which
representation to merge based on which one is most likely to apply
without conflicts (presumably the one that is on the branch that is most
similar to branch D).  Regardless of which branch was used as the source
of the merge, the resulting metadata would be

    novel-changes[D:13] = {B:3}.
    accumulated-novel-changes[D:13] = {D:1-12,B:3}.


This scheme generalizes to merges of multiple revisions at once.  If
B:10-12 are novel changes and they are merged into branch A, creating
A:15, then

    novel-changes[A:15] = {B:10-12}
    accumulated-novel-changes[A:15] = {A:1-4,B:3,A:6-14,B:10-12}.

This scheme can result in merge conflicts that would not be recognized
by other schemes.  For example, suppose that I have also merged B:10
into branch C creating C:18 with metadata

    novel-changes[C:18] = {B:10}
    accumulated-novel-changes[C:18] = {C:1-9,B:3,C:11-17,B:10}

Now an attempt to merge A:15 into C would result in a conflict because
A:15 represents {B:10-12}, but B:10 is already contained in the
accumulated-novel-changes of branch C.  (There is no way to know how to
apply only part of revision A:15.)

However, a merge of B:10-12 into C would be OK.  B:10 would be skipped
over (or not even listed in the "available-to-merge" list) because it is
already contained in C.

If a revision is reverted (by merging it backwards), then the associated
novel changes are "negated".  Thus if B:11 is reverted out of branch A,
creating A:20, then

    novel-changes[A:20] = {-B:11}
    accumulated-novel-changes[A:20] = {A:1-4,B:3,A:6-14,B:10,B:12,A:16-19}.

An attempt to revert a novel change that was not contained in the
branch's accumulated-novel-changes is a conflict.  Therefore, whereas
novel-changes can be negative, accumulated-novel-changes can only
contain "positive" novel changes.


Additional details:

- Of course the merge-tracking metadata would conceptually be recorded
at file rather than a branch resolution, though physically there should
be an inheritance system to prevent an explosion of metadata.

- In these examples I've named novel changes by the branch:revision in
which they first occurred.  In reality, any unique id would work.  In
fact, to support merging across repositories one could incorporate the
repository's UUID in the novel change IDs.

- There should be a way to manually adjust merge-tracking data (a la
"svnmerge.py merge --record-only") and to override the strict default
behavior.

- It would be nice to include the "svnmerge.py block" feature.

- One might want to record the literal source of a merge in addition to
the novel-changes that were involved.  (In other words, keep track of
whether the user above merged B:3 vs. A:5 vs. C:10.)


Advantages of this scheme:

- The merge-tracking metadata deduces the conceptual intention of a
programmers activities, not just the resulting textual changes, with no
extra effort by the programmer.

- Reverts are supported in a straightforward way.

- Double merges, incorrect reverts, etc. are prevented based on the
merge-tracking metadata rather than by fancy merge algorithms.

- Requires roughly the same amount of metadata as the svnmerge.py
scheme, though efficiency might require some database denormalization
resulting in more voluminous metadata.  I *think* that a clever script
could infer the novel-change data from svnmerge.py merge-tracking data,
provided said data were self-consistent.

- I think it could be implemented using svn properties, though some
changes to the property system would help.

- Unknown amount of work to implement.


Disadvantages of this scheme:

- Requires some programmer discipline, for example not to include
unrelated novel changes within a merge commit.

- Some of the behavior is rather implicit and might seem confusing (even
though I think the default behavior is consistent with what people want
to do 98% of the time)

- ???


What do people think of this?

Michael

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [merge tracking] Proposal to track "novel changes"

Posted by Daniel Berlin <db...@dberlin.org>.
On 10/25/06, David James <dj...@collab.net> wrote:
> On 10/25/06, Michael Haggerty <mh...@alum.mit.edu> wrote:
> > What do people think of this?
>
> Hi Michael,
>
> Currently, svnmerge.py tracks direct merge information, but it does
> not keep track of indirect merge information. If you merge from A->B,
> and then from B->C, svnmerge won't realise that changes from A were
> merged to C.
>
> Your proposal to track "novel changes" is well thought out. I think
> that this proposal will solve our problems with indirect merges.

>
> In your proposal, you track novel changes on a per-branch basis. I
> think that we should take your proposal a step further and track novel
> changes on a per-file basis.
>
> If I make changes to two files in a single commit (let's say I changed
> README.txt, and INSTALL.txt), I have made two novel changes. If I
> merge (or copy) the trunk README.txt onto a branch, the branch will
> have my README.txt change, but it won't have my INSTALL.txt change. We
> need to keep track of this information.

You need to be careful how far you go, or else you will get into an
insane amount of bookkeeping and comparison.

>
> In February, I made a very similar proposal, which assigned
> "change_id" numbers to each novel change to each file. When files are
> copied (or merged), these "change_id" numbers are copied, too. You can
> find my old proposal at
> http://svn.haxx.se/dev/archive-2006-02/0701.shtml

Doing this very efficiently requires FS changes, and even more RA/WC
changes (you'd have to track the "isresultofmerge" on a per-file basis
in the WC).  You are going to have to change every
add/delete/merge/apply_textdelta FS operation to have a flag saying
whether it is the result of a merge or not, so you know whether it is
part of a novel change or not.

None of this is impossible, it's just a large amount of work.
It also requires either some pretty hairy db schemas, or built-in fs support.

This is one reason I started looking at revlog format, which
explicitly supports per-file parents (and could be trivially extended
to differentiate merge-parents from non-merge parents, and then it
would be explicitly representing the information you are looking for.

You'd then be able to simply walk up the file tree to get the "novel
change" information for a single file.

I think it's great that you guys want to do all this.  Really.  One of
the original design goals was to have something one or two people
could implement for 1.5, and get us the functionality of svnmerge.py.
 Unless people are willing to commit to helping with these schemes,
they just aren't time feasible for a single minor version, IMHO.

It's very easy to come up with better designs than what we've got.  It
was not built to support complex merges, just the very simple merge
back and forth between a  branch and a trunk.    As I said on IRC, I
had a design like Michael's, but threw it out because it was too much
work for a single minor version time frame.  If ya'll want to use
Michael's instead of what we got now, I ain't gonna take any offense,
but i'm not sure it's the right answer for 1.5.
I'd be more than happy to be proven wrong of course.

--Dan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [merge tracking] Proposal to track "novel changes"

Posted by David James <dj...@collab.net>.
On 10/25/06, Michael Haggerty <mh...@alum.mit.edu> wrote:
> What do people think of this?

Hi Michael,

Currently, svnmerge.py tracks direct merge information, but it does
not keep track of indirect merge information. If you merge from A->B,
and then from B->C, svnmerge won't realise that changes from A were
merged to C.

Your proposal to track "novel changes" is well thought out. I think
that this proposal will solve our problems with indirect merges.

In your proposal, you track novel changes on a per-branch basis. I
think that we should take your proposal a step further and track novel
changes on a per-file basis.

If I make changes to two files in a single commit (let's say I changed
README.txt, and INSTALL.txt), I have made two novel changes. If I
merge (or copy) the trunk README.txt onto a branch, the branch will
have my README.txt change, but it won't have my INSTALL.txt change. We
need to keep track of this information.

In February, I made a very similar proposal, which assigned
"change_id" numbers to each novel change to each file. When files are
copied (or merged), these "change_id" numbers are copied, too. You can
find my old proposal at
http://svn.haxx.se/dev/archive-2006-02/0701.shtml

My old proposal is a bit inefficient for large repositories, because
it keeps individual change_ids which are associated with every copy of
every file. When you copy a file, the merge tracking information needs
to be copied, too. Unfortunately, my old proposal does not support
lazy copies, so some space (and time) is wasted when you branch. (The
current merge-tracking branch does not support lazy copies of merge
tracking metadata, either, so it is similarly inefficient.)

To keep Subversion performing well when you have lots of branches,
you'll need to support lazy copies of merge-tracking metadata. I've
already outlined a proposal for doing this. See
http://svn.haxx.se/dev/archive-2006-05/0061.shtml

Cheers,

David

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [merge tracking] Proposal to track "novel changes"

Posted by Daniel Berlin <db...@dberlin.org>.
> 
> I think that the merge-tracking scheme that is used by svnmerge.py and
> by the proposed merge-tracking extension to SVN will probably work
> pretty well for simple merge topologies.  But if there are any loops in
> the undirected graph of branches between which revisions have been
> merged, then it will break down.
>
There is no need to say this is an opinion (using 'think').
This is a fact :)
> Only in the case that the deltas are identical for B->A and B->C will
> the merge code recognize that A->C is a noop.  This problem is intrinsic
> to merge tracking the way that it has been implemented in svnmerge.py.
> Usually it is not noticed because people use a very simple merge
> topology (out of fear?).

Right. The fact that most people do simple merges is one reason why i
chose the scheme.  However, the main one was that it could be
implemented fairly easily, and would provide some level of merge
tracking without making it impossible to do more later.
>
>
> It seems to me that what one wants to track in a merge-tracking system
> is the creation and motion of "novel changes".

>  [On IRC I referred to
> these as "original revisions", but I think the new name is less
> ambiguous.]  I define a novel change to be one created by a developer by
> the application of brainpower and emacs (as opposed to a change
> resulting from a merge):
>
>     novel change == a change made from scratch (not via a merge)
>
> Most of the intellectual content of a project is in the novel changes,
> whereas (I'm exaggerating here) merges are mostly a matter of
> bookkeeping and tidying up.
Sure. This is no different than any other history sensitive merge.
Your goal is to avoid mergeing any change that already exists in your
code.  Except for hand patching (which you can't detect without
ridiculously sophisticated bookkeeping), all changes already existing
in your code come from merges, so all you really need to keep track of
is which *non-merge* changes exist, and merge those.

I am positive using such a scheme would give you better information,
and had i thought it could be done in 3-6 months, I would have :)

The problem with it is that it requires teaching the WC layer, RA
layer, and FS layer to be able to communicate which files were changed
due to merge, and which weren't (I firmly believe trying to track at a
level more than that, which would be required

This is non-trivial, and not something i wanted to attempt for 1.5,
when the original goal was to get something that can do what
svnmerge.py does, and maybe a little more.

I don't see why you couldn't implement something like this for 1.6/1.7
or 2.0 (depending on the changes required), and I disagree with your
irc statement that doing what the merge-tracking branch does somehow
makes it harder to move to such a scheme later than it already is.

Unless you're going to implement something to walk the FS history and
heuristically discover which changes were made through merges, and
which weren't, nothing will give you the info you want.

At least the current merge tracking scheme would tell you where each
merge occurred, so you could figure this info out.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org