You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@subversion.apache.org by Daniel Berlin <db...@dberlin.org> on 2006/04/28 16:34:04 UTC

Merge tracking proposal

Among other things I am working on at Google, I have been tasked
full-time with implementing merge tracking.

As part of this, I have come up with a design I plan on implementing for
tracking what revisions have been merged where, in a manner that is
suitable for use by various other operations (history sensitive merging,
etc).

In doing so, I reviewed the use cases that were kindly written up, and
believe that most if not all of them can be accomplished with this
design.

Please remember that this design is *only* for tracking what changes are
merged where.  I expect this to be the easy part, compared to deciding
exactly what algorithms our history sensitive merge uses, and how it
proceeds.

I have divided the design into four portions "Goals", "information
storage", "information updating", "other prereqs to being able to
implement the design".

The "random questions and answers" section is there to answer common
questions other developers I've talked to while coming up with this
design have had, in the hopes that it will answer some common queries
the list may have.

Goals:

The overarching goal here is to track the revision numbers being merged
by a merge operation, and keeping this information in the right places
as various operations (copy, delete, add, etc) are performed.

The goals of this design are:
1. To be able to track this down to what files in a working copy and be
able to determine what files have had what revisions merged into them.

2. To not need to contact the server more than we already do now to
determine which revisions have been merged in a file or directory (ie
some contact is acceptable, asking the server about each file is not).

3. To be able to edit merge information in a human editable form.

4. For the information to be stored in a space efficient manner, and to
be able to determine the revisions merged into a given file/director in
a time efficient manner.

5. Still getting a conservatively correct answer (not worse than what we
have now) when no merge info is specified.

6. To be able to collect, transmit, and keep this information up to date
as much as possible on the client side.

7. To be able to index this information in the future order to answer
queries

Specific Non-goals for *this design* include:
1. Doing actual history sensitive merging
2. Curing cancer (aka being all things to all people)

When reading the design presented here, please remember that it is
impossible to get something perfect in subversion on the first try, and
attempting to nit pick this to death will not actually help anything,
but it would be very annoying.  This is not to dissuade people from
suggesting design changes, but if you plan on suggesting a different
revision list format because you believe colon doesn't have a good level
of synergy with existing separators, or something, you may want to
rethink whether it really matters.

Some pre-notes:
The one argument i continually have with myself is whether to store info
in revprops, or just on dirs and files. If you want to try to
convincingly argue one way or the other, go for it. Certainly, I think
it makes certain semantics clearer on what operations do below and how
to proceed easier, the question is whether it is efficient enough time
wise when we go to retrieve merge info, and whether it complicates what
merge has to do too much.  It also removes all of the listed
pre-reqs :).

One could also try to argue that we should start with exactly the same
cases svnmerge does (IE only allow merge info at the wc roots, only
store it on that directory, etc), with a nicer integrated interface, and
try to expand it from there. I am open to such an argument as well. :)

Anyway, on with the design.


Information storage

The first question that many people ask is "where should we store the
merge information" (what we store will be covered next).

After a large amount of research, the design I have come up with is
this:
A merge info property, named SVN_MERGE_PROPERTY (not the real name, I
have made it a constant so we can have a large bikeshed about what to
really call it) stored in the revision properties, directory properties,
and file properties.
Each will store the *full, complete* list of current merged in changes,
as far as it knows.  This ensures that the merge algorithm and other
consumers do not have to walk back revisions in order to get the
transitive closure of the revision list.

The way we choose which of file, dir, revprop merge info to use in case
of conflicts simple system of inheritance[1] where the "most specific"
place wins.  This means that if the property is set on a file, that
completely overrides the directory and revision level properties.

The way we choose which to store to depends on how much and where you
merge, and will be covered in the semantics.

The reasoning for this system is to avoid having to either copy info
everywhere, or crawl everywhere, in order to determine which revisions
have been applied.  At the same time, we want to be space and time
efficient, so we can't just store the entire revision list everywhere.

As for what is stored:

For the large number of people i have talked to and heard about from
others, it seems the human editable *format* of how svnmerge stores
merge information (IE pathname and list of revisions) is fine.  Binary
storage of such information would buy, on average, a 2-3 byte decrease
per revision/range in size over ascii[1], while making it not directly
human editable.

As such, i have chosen to represent the revisions we have merged *into*
something as a path, a colon, and then a comma separated revision list,
containing one or more revision or  revision ranges.  Revision range end
and beginning points are separated by "-".

So the grammar looks something like this

revisionrange -> REVISION "-" REVISION

revisioneelement -> revisionrange | REVISION

revisionlist -> (revisionrange | REVISION)(COMMA revisioneelement)*

revisionline -> PATHNAME COLON revisionlist

top -> revisionline (NEWLINE revisionline)*

This list will *not* be stored in a canonicalized minimal form for a
path (IE it may contain single revision numbers that could be ranges).
This is chiefly because the benefit of such a canonical format (slightly
easier *comparison*, but not indexing) is heavily outweighed by the fact
that generating a canonical form may require groveling through a lot of
information to determine what that minimal canonical form is.  In
particular, it may be that the revision list "5,7,9" is, in minimal
canonical form, "5-9", because 6 and 8 do not have any affect on the
pathname that 5 and 9 are from.
Canonicalization could be done as a server side post pass because the
information is stored in properties.

Note that this revision format will not scale on its own if you have a
list of million revisions.  None will easily.  However, because it is
stored in properties, one can change the wc and fs backends to simply do
something different with this single property if they wanted to.
Given the rates of change of various very active repositories, this will
not be a problem we need to solve for many many years.

Information updating:
Each operation you can perform may update or copy the merge info
associated with a path, file, or revision.


svn add:  No change to merge info
svn delete: No direct change to merge info (indirectly, because the
props go away, so does the merge info for the file)
svn rename: No change to merge info
svn copy: Copies the merge info from the source path to the destination
path, if any.

This includes copying info from revprops, if necessary, by determining
if the merge info exists in a revprop for the last changed commit for
the source path, and copying it to the new revprop if it does (someone
probably needs to check if this is the right semantic :P)

All copies are full-copies of the merge information.

svn merge: Adds or subtracts to the merge info, according to the
following:

Where to put the info:
1. If the merge target is a single file, the merge info goes to the
property SVN_MERGE_INFO set on that file.
2. If the merge target is a non-wc-root directory, the merge info goes
to the property SVN_MERGE_INFO set on the directory
3. If the merge target is a wc-root directory, the merge info goes to
the property SVN_MERGE_INFO set on the revprop.

What info is put:
1. If you are merging in reverse, revisions are subtracted from the
revision lines, but we never write out anti-revisions.  Thus, if you
subtract all the merged revisions, you just get an empty list, and if
you do a reverse merge from there, you still get an empty list
2. If you are merging forward, the revision(s) you are merging is added
to the revision line in sorted order (such that all revisions and
revision ranges in the list are monotonically increasing from left to
right).  The exact details of how the range is represented in terms of a
list of single revs, or a revision range, is left as a quality of
implementation detail.  The only requirement is that the range be
correct.
3. The path (known as PATHNAME in the grammar) used as the key to
determine which revision line to change is the subdirectory path being
merged from, relative to the repo root, with the repo url stripped from
it.

Thus a merge of revisions 1-9 from http://foo.bar.com/reposroot/trunk
would produce "/trunk:1-9"

cross-repo merging is a bridge we can cross if we ever get there :).


pre-reqs for this design:

1. Need to be able to set a revprop to be stored on commit
2. Need to be able to say to copy a revprop from a particular revision
and only contact the server at commit time.

2. Need to be able to have auth treat SVN_MERGE_PROPERTY revprop
differently from other revprops (either by special casing the cases
users do care about controlling, or special casing props users don't
care about controlling, etc) so that people who don't have access to the
revprops can still do history sensitive merges of directories they do
have access to.


Random questions and answers

What happens if someone commits a merge with a non-merge tracking
client?
It simply means the next time you merge, you may receive conflicts that
you would have received if you were using a non-history-sensitive
client.

Can we do without the revprop portion of this design?
Technically yes, AFAIK, but it may require more crawling and querying at
merge time.

Can we do history sensitive wc<->wc merges without contacting the serve?
No. But you probably couldn't anyway, even if the revprop not being
stored locally issue were not here.

What happens if the info is not there?
The same thing that happens if the info is not there now.

What happens if a user edits merge info incorrectly?
They get the results specified by their merge info.

How does the revprop stay up to date?
We copy it from revision to revision.

What happens if a user manually edits a file and unmerges a revision (IE
not using a "reverse merge" command), but doesn't update the merge info
to match?
The merge info will believe the change has still been merged.

What happens if i svn move/rename a directory, and then merge it
somewhere?
This doesn't change history, only the future, thus we will simply add
the merge info for that directory as if it was a new directory.  We will
not do something like attempt to modify all merge info to specify the
new directory, as that would be wrong.

I don't think only that copying info on svn copy is correct, what if you
copy a dir with merge info into a dir where the dir has merge info,
won't it get the info wrong now?

No.  

Let's say you have

a/foo (merge info: /trunk:5-9
a/branches/bar (merge info: /trunk:1-4)

If you copy a/foo into a/branches/bar, we now have

a/branches/bar (merge info: /trunk:1-4)
a/branches/bar/foo (merge info: /trunk:5-9)

This is strictly correct.  The only changes which have been merged into
a/branches/bar/foo, are still 5-9.  The only changes which have been
merged into /branches/bar are 1-4.  No merges have been performed by
your copy, only copies have been performed.  If you perform a merge of
revisions 1-9 into bar, the results one would expect that the history
sensitive merge algorithm will skip revisions 5-9 for
a/branches/bar/foo, and skip revisions 1-4 for a/branches/bar.
The above information gives the algorithm the information necessary to
do this.

So if you want to argue svn copy has the wrong merge info semantics,
it's not because of the above, AFAIK :)


I'm sure that even in this long document, I've forgotten some things i
did spec out.
Apologies in advance.


Footnotes:
[1] This is not going to be a full blown design for property
inheritance, nor should this design depend on such a system being
implemented.

[2] Assuming 4 byte revision numbers, and repos with revisions numbering
in the hundreds of thousands.  You could do slightly better by variable
length encoding of integers, but even that will generally be 4 bytes for
hundreds of thousands of revs.  Thus, we have strings like "102341" vs 4
byte numbers, meaning you save about 2 bytes for a 4 byte integer.
Range lists in binary would need a distinguisher from single revisions,
adding a single bit to both (meaning you'd get 31 bit integers), and
thus, would require 8 bytes per range vs 12 bytes per range.  While 30%
is normally nothing to sneeze at space wise, it's also not significantly
more efficient in time, as most of the time will not be spent parsing
revision lists, but doing something with them. The space efficiency
therefore does not seem to justify the cost you pay in not making them
easily editable.



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Merge tracking proposal

Posted by Garrett Rooney <ro...@electricjellyfish.net>.

On 4/28/06, C. Michael Pilato <cm...@collab.net> wrote:
> Greg Hudson wrote:
> > On Fri, 2006-04-28 at 09:34 -0700, Daniel Berlin wrote:
> >
> >>2. If the merge target is a non-wc-root directory, the merge info goes
> >>to the property SVN_MERGE_INFO set on the directory
> >>3. If the merge target is a wc-root directory, the merge info goes to
> >>the property SVN_MERGE_INFO set on the revprop.
> >
> > What's a wc-root directory?
>
> We've used for a long time the terminology "working copy root", which is
> essentially a working copy directory whose on-disk parent directory is not
> also its versioned paths-as-they-are-in-the-repos parent.  See the docstring
> for update_editor.c:check_wc_root().

See, that's not all that check_wc_root() looks for though.  For
example, the current working directory is always considered to be a
working copy root by that function.

-garrett

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Merge tracking proposal

Posted by "C. Michael Pilato" <cm...@collab.net>.

Greg Hudson wrote:
> On Fri, 2006-04-28 at 09:34 -0700, Daniel Berlin wrote:
> 
>>2. If the merge target is a non-wc-root directory, the merge info goes
>>to the property SVN_MERGE_INFO set on the directory
>>3. If the merge target is a wc-root directory, the merge info goes to
>>the property SVN_MERGE_INFO set on the revprop.
> 
> What's a wc-root directory?

We've used for a long time the terminology "working copy root", which is
essentially a working copy directory whose on-disk parent directory is not
also its versioned paths-as-they-are-in-the-repos parent.  See the docstring
for update_editor.c:check_wc_root().

-- 
C. Michael Pilato <cm...@collab.net>
CollabNet   <>   www.collab.net   <>   Distributed Development On Demand

Re: Merge tracking proposal

Posted by Daniel Berlin <db...@dberlin.org>.

On Fri, 2006-04-28 at 14:49 -0400, C. Michael Pilato wrote:
> C. Michael Pilato wrote:
> > Daniel Berlin wrote:
> > 
> >>So if you commit a/b/c and a/b/d, your wc root is a/b
> > 
> > 
> > We use terms like "longest ancestor path" for this concept.  Sorry, but "wc
> > root" is already in use elsewhere.
> 
> Oops.  That's "longest common ancestor path".  Or something like that.
> 

Oh, you mean the wc root?
:)
(sorry, couldn't resist)




---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Merge tracking proposal

Posted by "C. Michael Pilato" <cm...@collab.net>.

C. Michael Pilato wrote:
> Daniel Berlin wrote:
> 
>>So if you commit a/b/c and a/b/d, your wc root is a/b
> 
> 
> We use terms like "longest ancestor path" for this concept.  Sorry, but "wc
> root" is already in use elsewhere.

Oops.  That's "longest common ancestor path".  Or something like that.

-- 
C. Michael Pilato <cm...@collab.net>
CollabNet   <>   www.collab.net   <>   Distributed Development On Demand

Re: Merge tracking proposal

Posted by "C. Michael Pilato" <cm...@collab.net>.

Daniel Berlin wrote:
> So if you commit a/b/c and a/b/d, your wc root is a/b

We use terms like "longest ancestor path" for this concept.  Sorry, but "wc
root" is already in use elsewhere.

-- 
C. Michael Pilato <cm...@collab.net>
CollabNet   <>   www.collab.net   <>   Distributed Development On Demand

Re: Merge tracking proposal

Posted by Daniel Berlin <db...@dberlin.org>.

On Fri, 2006-04-28 at 11:38 -0700, Daniel Berlin wrote:
> On Fri, 2006-04-28 at 14:32 -0400, Greg Hudson wrote:
> > On Fri, 2006-04-28 at 09:34 -0700, Daniel Berlin wrote:
> > > 2. If the merge target is a non-wc-root directory, the merge info goes
> > > to the property SVN_MERGE_INFO set on the directory
> > > 3. If the merge target is a wc-root directory, the merge info goes to
> > > the property SVN_MERGE_INFO set on the revprop.
> > 
> > What's a wc-root directory?
> 
> hehe. Apparently people aren't quite aware our wc has a notion of a root
> directory.
> 
> See fun functions like svn_wc_is_wc_root.

Yay, i fat fingered send.


So to actually *answer* the question, the wc root is the root of the
commit.

So if you commit a/b/c and a/b/d, your wc root is a/b

It may make more sense to think of it this way:

If the target gets a full, complete, merge, and has all the changes of
the merge, it goes to the revprop.



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Merge tracking proposal

Posted by Daniel Berlin <db...@dberlin.org>.

On Fri, 2006-04-28 at 14:32 -0400, Greg Hudson wrote:
> On Fri, 2006-04-28 at 09:34 -0700, Daniel Berlin wrote:
> > 2. If the merge target is a non-wc-root directory, the merge info goes
> > to the property SVN_MERGE_INFO set on the directory
> > 3. If the merge target is a wc-root directory, the merge info goes to
> > the property SVN_MERGE_INFO set on the revprop.
> 
> What's a wc-root directory?

hehe. Apparently people aren't quite aware our wc has a notion of a root
directory.

See fun functions like svn_wc_is_wc_root.







---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Merge tracking proposal

Posted by Greg Hudson <gh...@MIT.EDU>.

On Fri, 2006-04-28 at 09:34 -0700, Daniel Berlin wrote:
> 2. If the merge target is a non-wc-root directory, the merge info goes
> to the property SVN_MERGE_INFO set on the directory
> 3. If the merge target is a wc-root directory, the merge info goes to
> the property SVN_MERGE_INFO set on the revprop.

What's a wc-root directory?


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Merge tracking proposal

Posted by "Ph. Marek" <ph...@bmlv.gv.at>.

On Sunday 30 April 2006 03:30, Giovanni Bajo wrote:
> Daniel Berlin <db...@dberlin.org> wrote:
> > A merge info property, named SVN_MERGE_PROPERTY (not the real name, I
> > have made it a constant so we can have a large bikeshed about what to
> > really call it) stored in the revision properties, directory
> > properties, and file properties.
>
> My bikeshed is to call it svnmerge-integrated. Not sure if it's possible
> given that "svn:" namespace policy, but I'm sure many users will be happy
> of that.
>
> One thing that I would like to throw in is a problem we face in svnmerge:
> if you merge r100 from branch A into B, and the merge is done in a new
> commit (r200), there is no way svnmerge can know that r100 and r200 are,
> effectively, the same patch, just committed to two different branches. This
> is of course a problem if you then merge into branch C, since you should
> keep track of the fact that, if r100 is merged, you don't need to merge
> r200.
>
> svnmerge's problem comes from the fact that it can't know that r200 is
> going to be r200 before it is actually committed, so it can't record it
> into the property somehow. I believe this should be fixed if the backend
> handles merge tracking.
IMO we might need (and should, to facilitate cross-repo merging later) to give 
every revision some kind of uuid, that can be replicated along with the data, 
and doesn't change like the revision number.
That would solve the problem of the unknown-revision-number on commit, too.
(If I mirror some repository with svk, I get completely different revision 
numbers).
Note: this uuids could be done even now, by a simple perl/python/shell script 
or whatever. Repository dump/load is not required.


But that would bring us the other mentioned problem: Huge lists of merged 
revisions.
Here it might be necessary to store *not* the list, but a kind of tree - 
"I merged all that was included in uuid A, and got revision uuid B down do 
uuid D too".
I'll try to paint that.

I'll include the uuids in (). Uppercase letters stand for the result, 
lowercase for other changes in this revision *only*.
     /trunk                /branch
       R1 (A) --copy------>  R2 (A)
doing changes ...               
       R2 (C=A+c)            R4 (D=A+d)
       R5 (E=C+e)            R6 (F=D+f)
       R7 (G=E+g)               /   
merging:                       /
       R8 (H=G+F+h)  <--merge-/
+h is neede if small fixes/conflicts have to be resolved.


The obvious problem for that is that the merger (client or server) has to 
traverse the tree (or even graph!) until a common revision (A in this case) 
is found. But that'll happen always, I believe...


If we unmerge (reverse-merge) a revision, we'd have to log a eh. I=H-e or some 
such ...


The storing of the "calculation" saves space, but needs a kind of arithmetic 
processing before merging.



Just my €0.02.


Regards,

Phil

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Merge tracking proposal

Posted by Giovanni Bajo <ra...@develer.com>.

Daniel Berlin <db...@dberlin.org> wrote:

> A merge info property, named SVN_MERGE_PROPERTY (not the real name, I
> have made it a constant so we can have a large bikeshed about what to
> really call it) stored in the revision properties, directory
> properties, and file properties.

My bikeshed is to call it svnmerge-integrated. Not sure if it's possible given
that "svn:" namespace policy, but I'm sure many users will be happy of that.

One thing that I would like to throw in is a problem we face in svnmerge: if
you merge r100 from branch A into B, and the merge is done in a new commit
(r200), there is no way svnmerge can know that r100 and r200 are, effectively,
the same patch, just committed to two different branches. This is of course a
problem if you then merge into branch C, since you should keep track of the
fact that, if r100 is merged, you don't need to merge r200.

svnmerge's problem comes from the fact that it can't know that r200 is going to
be r200 before it is actually committed, so it can't record it into the
property somehow. I believe this should be fixed if the backend handles merge
tracking.

> This is chiefly because the benefit of such a canonical format
> (slightly easier *comparison*, but not indexing) is heavily
> outweighed by the fact that generating a canonical form may require
> groveling through a lot of information to determine what that minimal
> canonical form is.  In particular, it may be that the revision list
> "5,7,9" is, in minimal canonical form, "5-9", because 6 and 8 do not
> have any affect on the pathname that 5 and 9 are from.
> Canonicalization could be done as a server side post pass because the
> information is stored in properties.

Notice that this canonicalization is done by svnmerge whenever it comes from
free, which is almost always since it always has to run a svn log to see which
revisions to merge. And when it's parsing the svn log output, it's easy to
notice gaps and use them to canonicalize. If you look at svnmerge code, these
commits to other branches are called "phantom revisions".

Producing a canonicalizes range has one very big benefit, when doing merge
tracking in development branch: it keeps the merge property very small and
readable. Basically, it's always something like "/trunk:1-1000", which is very
easy for the human ("this branch was last merged at r1000"). If you don't do
this, the property gets several kilobytes long after a few months (especially
if you have a large repository with many branches or - worse - many different
projects in it). When you have 10 pages worth of random numbers, it's pretty
hard to know what's going on, and people will go wtf.

In other words, if you don't do canonicalization through phantom revisions,
you're going to miss most of the advantages of using an ASCII property. You can
as well use a binary one at that point.

Giovanni Bajo

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Merge tracking proposal

Posted by Giovanni Bajo <ra...@develer.com>.

Daniel Berlin <db...@dberlin.org> wrote:

> The reason I said "there is simply nothing you can do about this" in
> reference to the length of the lists as the number of merges grows
> large.  No matter how much you minimize, there is always a worst case
> that will make the lists large. For example, merging every 3rd
> revision would produce lists that can't be minimized.

Sure, but let's remember that all users following the "Repeated Merge" pattern
(which, in my experience, are the most) *will* benefit from normalization of
revision range with insertion of phantom revisions. For them, the
merge-tracking info would always be a compact range like: /trunk@123:1-4691.

As for your case of merging every 3rd revision, what about the other revisions?
If you add support for "Block/Unblock Change Set" (which I see it's not in your
proposal), and people do block the other revisions, you could still show a
compact range in request to an user query (if and only if you take care of
phantom revisions at the same time).

Giovanni Bajo

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Merge tracking proposal

Posted by Daniel Berlin <db...@dberlin.org>.

> > However, there is simply nothing you can do about this.
> 
> On the contrary, the person who talked about 'svnmerge.py' said that it was 
> able to reduce lists like "1,2,3,4,5" to ranges like "1-5" and even do the same 
> for lists with gaps where the missing revision numbers were irrelevant to the 
> target.

We must be talking past each other.

Certainly, you can minimize the length in terms of making sure
continuous numbers are merged into ranges (and in fact, our
implementation will try), i've simply not made it *required* in the
design, because i haven't estimated the performance impact yet.  If it
turns out to be minimal performance impact, i probably will modify the
design to make it required of all implementations that touch merge
tracking info.

The reason I said "there is simply nothing you can do about this" in
reference to the length of the lists as the number of merges grows
large.  No matter how much you minimize, there is always a worst case
that will make the lists large. For example, merging every 3rd revision
would produce lists that can't be minimized.






---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Merge tracking proposal

Posted by Julian Foad <ju...@btopenworld.com>.

Daniel Berlin wrote:
> I will post the revised document. In the meanwhile, here are answers to
> some of the open ended questions you have asked.

Thanks; I've just seen the second revision, but haven't read it properly yet.

>>It would be informative to apply this algorithm to the merges that have already 
>>been done on Subversion's repository, to see what the result is.  For instance, 
>>that might give a reasonable indication of whether the lists of revisions are 
>>going to grow too long to be considered human-usable, as someone wondered.
> 
> I did before i started, and the list of revisions doesn't grow that long
> (It's about 20-30 revisions).

That's good to hear.

> However, there is simply nothing you can do about this.

On the contrary, the person who talked about 'svnmerge.py' said that it was 
able to reduce lists like "1,2,3,4,5" to ranges like "1-5" and even do the same 
for lists with gaps where the missing revision numbers were irrelevant to the 
target.

[...]
> What do you forsee humans doing to it that requires actually caring
> about all 100k revisions in the list?

I don't know.  Simply loading the text into an editor and finding the 
appropriate place in the line would fail or take an unreasonably long time with 
some editors that have line length limitations or assumptions.  OK, you can 
blame those editors.  Anyway, that was just one aspect to think about.

>>>The first question that many people ask is "where should we store the
>>>merge information" (what we store will be covered next).
>>
>>Well, they may ask, but it doesn't make much sense to discuss this until we 
>>know what information is to be stored.
> 
> Actually, this is wrong in this case, even if it may be true in general.
> How you store the merge info will generally dictate what you can store,
> for performance and space reasons.  Thus, it makes perfect sense to
> specify how you store it first.

This is irrelevant to the proposal itself, but it seems we misunderstood each 
other.  By "what information" I meant information in the abstract, logical 
sense, not the particular encoding of it.  I meant that you can't choose a 
suitable storage location until you know certain things about the data you want 
to store - its required lifetime, whether its size is fixed or variable, etc.

- Julian

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Merge tracking proposal

Posted by Daniel Berlin <db...@dberlin.org>.

I will post the revised document. In the meanwhile, here are answers to
some of the open ended questions you have asked.


> It would be informative to apply this algorithm to the merges that have already 
> been done on Subversion's repository, to see what the result is.  For instance, 
> that might give a reasonable indication of whether the lists of revisions are 
> going to grow too long to be considered human-usable, as someone wondered.

I did before i started, and the list of revisions doesn't grow that long
(It's about 20-30 revisions).

However, there is simply nothing you can do about this.
Human editing was meant to fix merge info for changes you have made that
will affect it that subversion doesn't know about, not to do anything
else.  This generally consists of removing a merged rev, adding one, or
splitting one.  None of these require anything more than looking up a
specific number or range in the merge info, and doing the right thing to
it, so i can't see how, even if the list was 100k revisions long (and
sorted), how this would not be "human usable".

What do you forsee humans doing to it that requires actually caring
about all 100k revisions in the list?

> > Information storage
> > 
> > The first question that many people ask is "where should we store the
> > merge information" (what we store will be covered next).
> 
> Well, they may ask, but it doesn't make much sense to discuss this until we 
> know what information is to be stored.


Actually, this is wrong in this case, even if it may be true in general.
How you store the merge info will generally dictate what you can store,
for performance and space reasons.  Thus, it makes perfect sense to
specify how you store it first.
> I half-understood that some parents/grandparents might store copies of the 
> merge info that is on this object.  If so, and if you don't explicitly remove 
> the other copies of this info from the parent dir(s), won't obsolete history 
> build up, that is not incorrect but is annoying?

Yes, obsolete info can build up.



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Merge tracking proposal

Posted by Julian Foad <ju...@btopenworld.com>.

Daniel,

I've read through and come up with a list of questions as they occurred to me. 
  Some are rather open-ended; most are basically asking for more information or 
explanation.  I don't expect you to answer them all in full (e.g. providing a 
full set of use cases and examples in response to my "Use cases?" question). 
If you think about them and try to answer most of them within the next draft of 
the proposal, I'll read that and see if there is anything still unclear.

Daniel Berlin wrote:
> As part of this, I have come up with a design I plan on implementing for
> tracking what revisions have been merged where, in a manner that is
> suitable for use by various other operations (history sensitive merging,
> etc).

Scope?  That is, what are the limits of the kinds of merging this is intended 
to support?  Automatic merges client-side?  Server-side?  Manual merges (i.e. 
where no "svn merge" command was used)?  More?

> In doing so, I reviewed the use cases that were kindly written up, and
> believe that most if not all of them can be accomplished with this
> design.

Can you give references to these use cases?

Examples of how this design works in practice in these cases?

(Trunk-to-release-branch, feature-branch-to-trunk, repeated merging, vendor 
branch, undoing a change, ...)

It would be informative to apply this algorithm to the merges that have already 
been done on Subversion's repository, to see what the result is.  For instance, 
that might give a reasonable indication of whether the lists of revisions are 
going to grow too long to be considered human-usable, as someone wondered.

> Goals:
> 
> The overarching goal here is to track the revision numbers being merged
> by a merge operation, and keeping this information in the right places
> as various operations (copy, delete, add, etc) are performed.
> 
> The goals of this design are:
> 1. To be able to track this down to what files in a working copy and be
> able to determine what files have had what revisions merged into them.
> 
> 2. To not need to contact the server more than we already do now to
> determine which revisions have been merged in a file or directory (ie
> some contact is acceptable, asking the server about each file is not).
> 
> 3. To be able to edit merge information in a human editable form.
> 
> 4. For the information to be stored in a space efficient manner, and to
> be able to determine the revisions merged into a given file/director in
> a time efficient manner.
> 
> 5. Still getting a conservatively correct answer (not worse than what we
> have now) when no merge info is specified.
> 
> 6. To be able to collect, transmit, and keep this information up to date
> as much as possible on the client side.
> 
> 7. To be able to index this information in the future order to answer
> queries
> 
> Specific Non-goals for *this design* include:
> 1. Doing actual history sensitive merging
> 2. Curing cancer (aka being all things to all people)

> The one argument i continually have with myself is whether to store info
> in revprops, or just on dirs and files. If you want to try to
> convincingly argue one way or the other, go for it. Certainly, I think
> it makes certain semantics clearer on what operations do below and how
> to proceed easier, the question is whether it is efficient enough time
> wise when we go to retrieve merge info, and whether it complicates what
> merge has to do too much.  It also removes all of the listed
> pre-reqs :).

In this design, what purpose does the rev-prop serve?  Aren't you using it for 
just the same purposes that you would use a property on the repository root 
directory, and yet, being a rev-prop, it has completely different behaviour?  I 
don't see why you would want to do that.

(I've now seen your later comment that you're on the brink of throwing away the 
rev-prop part of this proposal.  +1 on that.)

> One could also try to argue that we should start with exactly the same
> cases svnmerge does (IE only allow merge info at the wc roots, only
> store it on that directory, etc), with a nicer integrated interface, and
> try to expand it from there. I am open to such an argument as well. :)

> 
> Information storage
> 
> The first question that many people ask is "where should we store the
> merge information" (what we store will be covered next).

Well, they may ask, but it doesn't make much sense to discuss this until we 
know what information is to be stored.

> A merge info property, named SVN_MERGE_PROPERTY (not the real name, I
> have made it a constant so we can have a large bikeshed about what to
> really call it) stored in the revision properties, directory properties,
> and file properties.
> Each will store the *full, complete* list of current merged in changes,

Complete list of what?  The merge-prop on an item (say directory /d1/d2) shall 
list all the changes that have ever been merged into this item, including 
indirectly (via merging a change that partly consisted of a previous merge), 
and including any merges to its parent (/d1) or grandparents that modified it?

Is there significant duplication of information among these lists?  (I can't 
tell yet.)  If so, that is likely to make manual editing unsafe.

> as far as it knows.  This ensures that the merge algorithm and other
> consumers do not have to walk back revisions in order to get the
> transitive closure of the revision list.

Could you expand on this?  I don't follow especially "walk back revisions" and 
"transitive closure".

> The way we choose which of file, dir, revprop merge info to use in case
> of conflicts simple system of inheritance[1] where the "most specific"
> place wins.  This means that if the property is set on a file, that
> completely overrides the directory and revision level properties.

> As for what is stored:

> revisionline -> PATHNAME COLON revisionlist
> 
> top -> revisionline (NEWLINE revisionline)*

Semantics?  This merge history ("top"), existing on a file, dir or repo, 
specifies all the changes that have ever been merged into this object (file, 
dir or repo) within this repository.  It specifies the sources of the merges, 
(and thus two or more pathnames may be required to represent one source object 
at different revisions due to renaming).  Is that right?

What is the peg revision for PATHNAME?  Something like "rev" for each "rev" in 
the list, such that a single "revisionline" can list changes taken from more 
than one source object?

The merge history for a file is a subset of the history lines for its dir, and 
the history of the dir similarly of its immediate parent, so on upwards?  Or 
not - are intermediate dirs allowed to have no history?  How is that 
relationship maintained?

How do you handle the indirect merge situation (merging a change that contains 
a previous merge)?  Do the revision numbers of both the earlier, little merge, 
and the later, bigger merge that includes the little one, appear in the 
list(s)?  For instance,

   r10 modifies /branch1/f and /branch1/g

   r12 merges r10 from /branch1 into /branch2
     /branch2 says "/branch1:10"

   r14 merges r10 from /branch1/f into /trunk/f
     /trunk/f says "/branch1/f:10"

   r16 merges r12 from /branch2 into /trunk  (carefully avoiding repeating the 
r14 part of r10, as it's already known to be here)

What do /trunk, /trunk/f, /trunk/g say?

> svn add:  No change to merge info
> svn delete: No direct change to merge info (indirectly, because the
> props go away, so does the merge info for the file)

I half-understood that some parents/grandparents might store copies of the 
merge info that is on this object.  If so, and if you don't explicitly remove 
the other copies of this info from the parent dir(s), won't obsolete history 
build up, that is not incorrect but is annoying?

(I can see that it may be difficult or impossible to determine what info can be 
removed from the parents.)

> svn rename: No change to merge info
> svn copy: Copies the merge info from the source path to the destination
> path, if any.
> 
> This includes copying info from revprops, if necessary, by determining
> if the merge info exists in a revprop for the last changed commit for
> the source path, and copying it to the new revprop if it does (someone
> probably needs to check if this is the right semantic :P)
> 
> All copies are full-copies of the merge information.
> 
> svn merge: Adds or subtracts to the merge info, according to the
> following:
> 
> Where to put the info:
> 1. If the merge target is a single file, the merge info goes to the
> property SVN_MERGE_INFO set on that file.
> 2. If the merge target is a non-wc-root directory, the merge info goes
> to the property SVN_MERGE_INFO set on the directory
> 3. If the merge target is a wc-root directory, the merge info goes to
> the property SVN_MERGE_INFO set on the revprop.

Why the difference between wc-root and non-wc-root?  How do you determine 
whether a directory specified in a client operation is a wc-root or non-wc-root?

I saw a later message saying that by "wc-root" you meant "longest common 
ancestor path" of a commit operation, but I still don't understand.  A commit 
is not necessarily going to be done until well after the merge command and 
potentially other merges and other operations have been done in the WC.

Is this all to do with the fact that you need write access to the properties of 
some parent directory which may not be present or may not be locked for write 
access?

> What info is put:
> 1. If you are merging in reverse, revisions are subtracted from the
> revision lines, but we never write out anti-revisions.  Thus, if you
> subtract all the merged revisions, you just get an empty list, and if
> you do a reverse merge from there, you still get an empty list
> 2. If you are merging forward, the revision(s) you are merging is added
> to the revision line

These (1 and 2) seem reasonable.

When a merge has been performed in the WC but not yet committed, and a merge 
has been committed to the repository in the meantime, how is "svn update" going 
to merge the latest repository version of the merge-history property into the 
WC version of it - (a) when the update goes smoothly, and (b) when the update 
has conflicts?

As you have had various bits of feedback already, I think it would be useful if 
you could post the latest revision of the proposal soon, regardless of how much 
it addresses my questions.

- Julian

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Merge tracking proposal

Posted by Garrett Rooney <ro...@electricjellyfish.net>.

On 4/28/06, Daniel Berlin <db...@dberlin.org> wrote:

> Information storage
>
> The first question that many people ask is "where should we store the
> merge information" (what we store will be covered next).
>
> After a large amount of research, the design I have come up with is
> this:
> A merge info property, named SVN_MERGE_PROPERTY (not the real name, I
> have made it a constant so we can have a large bikeshed about what to
> really call it) stored in the revision properties, directory properties,
> and file properties.
> Each will store the *full, complete* list of current merged in changes,
> as far as it knows.  This ensures that the merge algorithm and other
> consumers do not have to walk back revisions in order to get the
> transitive closure of the revision list.
>
> The way we choose which of file, dir, revprop merge info to use in case
> of conflicts simple system of inheritance[1] where the "most specific"
> place wins.  This means that if the property is set on a file, that
> completely overrides the directory and revision level properties.

(Some questions were asked on IRC, so I'll try to replicate the
answers here, DannyB will hopefully correct me if I"ve screwed them up
too badly.)

Note that the answer of which is the "most specific place" needs to be
calculated more than once, for example if you've got merge props on
both dir A and dir A/B, then stuff under dir A/B uses A/B's merge
props and other things use dir A's.  Note that this implies that later
merges to A need to update A/B's props.

> The way we choose which to store to depends on how much and where you
> merge, and will be covered in the semantics.
>
> The reasoning for this system is to avoid having to either copy info
> everywhere, or crawl everywhere, in order to determine which revisions
> have been applied.  At the same time, we want to be space and time
> efficient, so we can't just store the entire revision list everywhere.

Note that in order for having revprops used for storage actually
become more efficient you need to assume that merge calculations are
being done server side, and thus have quick access to all the
revprops.  If the calculation happens client side obviously dropping
revprops is a win.

-garrett

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Merge tracking proposal

Posted by Daniel Berlin <db...@dberlin.org>.

> > Random questions and answers
> > Can we do without the revprop portion of this design?
> > Technically yes, AFAIK, but it may require more crawling and querying at
> > merge time.
> 
> I am not sure why this would be true.  Revprops are not stored in the WC, 
> so won't this require a lot of repository access to retrieve the revprops? 
>  Or do you envision all of this happening on the server anyway?

Almost all of the "determining what we want to merge in" part of history
sensitive merging part will have to happen server side no matter what ew
do, because in no design other than keeping the repo local do clients
ever have as much info as the server.  Thus, if you wanted the a
absolute best job, you'd still have to do it server side.  Unless we
really want to try to keep *all* the merge history for an entire repo's
history client side, and keep it in sync.

The "efficiency" part is just because in our current backend fs designs,
revprop access is much faster than random dir property access.

Of course, this may be a complete wash anyway, because you have to get a
bunch of info about those dirs *anyway*.

As I said, i seesaw back and forth about whether to deal with revprops
at all because if you don't, you get the following advantages:

1. Much easier auditability
2. Don't need to care about wc roots/commit-roots
3. Don't need to try to bring revprops from rev to rev.

There is always some server contact in merges right now, except wc-wc
merges.

The only thing putting *any* of this in the client buys you is that the
latest version of the merge history for your *target* is there (in the
form of dir/file props), and is editable.  Of course, with revprops,
that's not quite true :(

Source url merge history is always going to need to be looked up in some
manner.

I'm literally about 5 seconds away from throwing away the revprop part
of storage, because it just seems to make things significantly more
confusing.




---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Merge tracking proposal

Posted by Mark Phippard <ma...@softlanding.com>.

NOTE:  These comments are only somewhat associated with your design.  I 
have wanted to add better integration of Subversion with Issue Tracking 
systems and thought revision properties would be a good way to store this 
information.  So these comments are based more on the usage of rev props.

Daniel Berlin <db...@dberlin.org> wrote on 04/28/2006 12:34:04 PM:

> 1. Need to be able to set a revprop to be stored on commit

You should potentially see:

http://subversion.tigris.org/issues/show_bug.cgi?id=1976

It would be nice if the general capability in that issue were added at the 
same time.

> 3. Need to be able to have auth treat SVN_MERGE_PROPERTY revprop
> differently from other revprops (either by special casing the cases
> users do care about controlling, or special casing props users don't
> care about controlling, etc) so that people who don't have access to the
> revprops can still do history sensitive merges of directories they do
> have access to.

Depending on how you implement this you might also need to take hooks into 
consideration since revprops cannot be changed without the hook enabled. I 
would assume from #1 that you would build setting the revprop into the 
commit transaction so there would be no more need to change this revprop 
then something like svn:log.

Since revprops are not stored in the WC, where would this info be stored 
so that commit would know how to include it?  Maybe commit copies the 
previous revprop and adds the new info to it?

> Random questions and answers
> Can we do without the revprop portion of this design?
> Technically yes, AFAIK, but it may require more crawling and querying at
> merge time.

I am not sure why this would be true.  Revprops are not stored in the WC, 
so won't this require a lot of repository access to retrieve the revprops? 
 Or do you envision all of this happening on the server anyway?

Thanks

Mark






_____________________________________________________________________________
Scanned for SoftLanding Systems, Inc. and SoftLanding Europe Plc by IBM Email Security Management Services powered by MessageLabs. 
_____________________________________________________________________________

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org