You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Paul Burba <pt...@gmail.com> on 2010/04/13 15:39:05 UTC

[RFC] Incremental dumps and mergeinfo (Was: Vetoing latest issue #3020 fix in 1.6.10)

On Wed, Mar 31, 2010 at 10:05 AM, Paul Burba <pt...@gmail.com> wrote:
> Mike and I were discussing the changes I made in
> http://svn.apache.org/viewvc?view=revision&revision=927243 to fix
> issue #3020 and which were backported to 1.6.x.  There is a regression
> in that fix and I am changing my vote to -1 and pulling it from 1.6.x
> (and today's roll of 1.6.10).
>
> The fix in r927243 addressed the problem of mergeinfo in a partial
> dump of a repository, specifically:
>
> We dump -r(X>1):Y from repos A then load that dump into repos B.  If
> there is mergeinfo in the loaded revisions it may refer to revisions <
> X.  r927423 strips out these ranges.  This is fine if the partial dump
> of repos A is done in one step, e.g,
>
>  svnadmin dump reposA -r200:300 >  A.200.300.partial.dump
>  svnadmin load reposB < A.200.300.partial.dump
>
> because those revisions don't refer to valid history re the
> mergeinfo's merge source.
>
> Unfortunately this fix breaks a (likely much more) common use case:
> Dumping a complete repository in multiple steps and then loading each
> chunk to the new repository, e.g.:
>
>  svnadmin dump reposA -r0:100                 >  A.0.100.dump
>  svnadmin dump reposA -r101:200 --incremental >  A.101.200.dump
>  svnadmin dump reposA -r201:300 --incremental >  A.201.300.dump
>
>  svnadmin load reposB < A.0.100.dump
>  svnadmin load reposB < A.101.200.dump
>  svnadmin load reposB < A.201.300.dump
>
> In this case, valid mergeinfo may be filtered from the 2nd and or 3rd load.
>
> I'll work on a fix that can handle both use cases, but for now I am
> changing my vote to -1 and reverting this backport.
>
> Paul

After thinking about this some more I see three options:

A) Keep the pre-927243 behavior as the default and thus support the
incremental dump use case by default.  Add a new option to svnadmin
load, --skip-missing-merge-sources (or maybe
--filter-missing-merge-sources) to activate the filtering of mergeinfo
sources outside of the dump stream (i.e. make the r927243 changes only
take affect with this option).  The obvious drawback is that admins
must know to use this option.  They would still be able to partially
dump a repos then load it into an empty (or unrelated) repos and have
bogus mergeinfo.

B) If the load stream's mergeinfo contains a merge source-rev pair
that predates the start of the load stream, but exists in the target
repository, then allow it to be loaded, otherwise filter it.  The
drawbacks are two:  First, performance; we'd need to check every
path/rev pair of incoming mergeinfo which certainly isn't going to
speed up a load*.  Second, it may be mere coincidence that the
path/rev exists in the target repository at the start of the load.

C) Revert r927243 and move the fix into svnadmin load: Give an error
when doing a partial dump that contains mergeinfo with revisions that
predate the starting rev of the dump and require the use of a new
--missing-merge-source=[skip|allow] option to successfully complete
the dump (i.e. something analogous to svndumpfilter's
--skip-missing-merge-sources).

Of course there is always:

D) <Insert your brilliant idea here>

Any preferences as to what approach to take?  Or ideas for a superior fix?

I lean towards A), since B) subjects every load to the overhead of
checking for valid merge sources, even if nothing is ultimately
filtered.  If the incremental dump use case really is more common, we
are punishing those admins with nothing to show for it.  C) is ugly
because you might not get an error until the dump is well underway
(maybe hours along).  Then you have to start all over again with the
new option.

Paul

* Testing a crude patch right now with some incremental dumps of the
old S.T.O. repos to get an feel for how serious of a penalty this
might be.

Re: [RFC] Incremental dumps and mergeinfo (Was: Vetoing latest issue #3020 fix in 1.6.10)

Posted by Paul Burba <pt...@gmail.com>.
On Wed, Aug 4, 2010 at 4:11 AM, Julian Foad <ju...@wandisco.com> wrote:
> On Tue, 2010-08-03 at 14:35 -0400, Paul Burba wrote:
>> Sweeping through my TODO list turned up this thread.
>>
>> I'll commit this last patch tomorrow if there are no objections.
>
> No objections to this improvement from me.
>
> - Julian

On Wed, Aug 4, 2010 at 2:03 PM, C. Michael Pilato <cm...@collab.net> wrote:
> On 08/03/2010 02:35 PM, Paul Burba wrote:
>> Sweeping through my TODO list turned up this thread.
>>
>> I'll commit this last patch tomorrow if there are no objections.
>
> None here.

Thanks for taking a minute gents.  Committed
http://svn.apache.org/viewvc?view=revision&revision=982673

Paul

Re: [RFC] Incremental dumps and mergeinfo (Was: Vetoing latest issue #3020 fix in 1.6.10)

Posted by Julian Foad <ju...@wandisco.com>.
On Tue, 2010-08-03 at 14:35 -0400, Paul Burba wrote:
> Sweeping through my TODO list turned up this thread.
> 
> I'll commit this last patch tomorrow if there are no objections.

No objections to this improvement from me.

- Julian


> I pause a day only because I recall Mike saying he was planning to
> take a look at this.
> 
> Paul
> 
> On Tue, May 11, 2010 at 4:00 PM, Paul Burba <pt...@gmail.com> wrote:
[...]
> > At any rate, after my recent work we are left with one common(?) use
> > case still broken:
> >
> > Loading a sequence of incremental dumps when the fist load in the
> > sequence is into a *non-empty* repostitory.
> >
> > This is tested in svnadmin_tests.py 20 'don't filter mergeinfo revs
> > from incremental dump', specifically see '# PART 4: Load a a series of
> > incremental dumps to an non-empty repository."
> >
> > Basically the problem is this:
> >
> > 1) Given repository ReposA with X revisions in it.
> >
> >   Note: Previously this use case was also broken if ReposB
> >   was empty to start, but this now works (assuming none of
> >   the incremental dumps in step 3 were svndumpfiltered in
> >   such a way as to remove/renumber revisions).
> >
> > 2) Given repository ReposB with Y revisions in it.
> >
> > 3) Incrementally (but fully) dump ReposA
> >
> >   svnadmin dump ReposA -r0:100                 >  A.0.100.dump
> >   svnadmin dump ReposA -r101:200 --incremental >  A.101.200.dump
> >   svnadmin dump ReposA -r201:300 --incremental >  A.201.300.dump
> >   .
> >   .
> >   .
> >   svnadmin dump ReposA -rW:X     --incremental >  A.W.X.dump
> >
> > 4) Load the ReposA incremental dumps to ReposB
> >
> >   svnadmin load ReposB --parent-dir /some/subtree < A.0.100.dump
> >   svnadmin load ReposB --parent-dir /some/subtree < A.101.200.dump
> >   svnadmin load ReposB --parent-dir /some/subtree < A.201.300.dump
> >   .
> >   .
> >   .
> >   svnadmin load ReposB --parent-dir /some/subtree < A.W.X.dump
> >
> > The problem arises when one of the incremental loads contains
> > references to mergeinfo source revisions that predate the start of the
> > dump.  For example, say A.201.300.dump contains mergeinfo referencing
> > r82.  When this is loaded into ReposB the reference should be changed
> > to r(82+Y) to reflect the fact that ReposB has Y revisions in it
> > before we ever started loading parts of ReposA.  Currently no change
> > is made and the source rev stays at r82, which is obviously incorrect.
> >
> > The attached patch fixes this problem.  It takes all mergeinfo which
> > predates the first revision in the dump stream and adjusts it by the
> > difference between the head rev in ReposB and the current dump stream
> > revision.  Yes, this is a fairly fragile solution (as is our copyfrom
> > sources mapping, which this is based on).  If unrelated commits are
> > made to ReposB between loads of the incremental dumps, then the
> > mergeinfo source revs will be wrong; but it is *always* wrong right
> > now, at least this would fix this use case.  Opinions appreciated.
> >
> > And yes, there are still other potential problems: Think incremental
> > dumps that are dumpfiltered such that some revs are dropped or
> > renumbered.  But I'm not sure we really want to solve *every* problem
> > in this space, I'm not even sure we can (without bringing the
> > performance of svnadmin load to its knees).  I think this patch brings
> > us to "good enough" with the caveat that ill-advised use of svnadmin
> > dump/load and/or svndumpfilter can result in incorrect mergeinfo.
> >
> > Paul
> >


Re: [RFC] Incremental dumps and mergeinfo (Was: Vetoing latest issue #3020 fix in 1.6.10)

Posted by "C. Michael Pilato" <cm...@collab.net>.
On 08/03/2010 02:35 PM, Paul Burba wrote:
> Sweeping through my TODO list turned up this thread.
> 
> I'll commit this last patch tomorrow if there are no objections.

None here.

-- 
C. Michael Pilato <cm...@collab.net>
CollabNet   <>   www.collab.net   <>   Distributed Development On Demand


Re: [RFC] Incremental dumps and mergeinfo (Was: Vetoing latest issue #3020 fix in 1.6.10)

Posted by Paul Burba <pt...@gmail.com>.
Sweeping through my TODO list turned up this thread.

I'll commit this last patch tomorrow if there are no objections.

I pause a day only because I recall Mike saying he was planning to
take a look at this.

Paul

On Tue, May 11, 2010 at 4:00 PM, Paul Burba <pt...@gmail.com> wrote:
> On Tue, Apr 20, 2010 at 11:08 PM, C. Michael Pilato <cm...@collab.net> wrote:
>> Paul Burba wrote:
>>> Hi Mike and Julian,
>>>
>>> Sorry for the delayed response; been working on/thinking about other
>>> mergey stuff lately...
>>>
>>> A few inline comments and then a proposed course of action at the end.
>>>
>>> On Thu, Apr 15, 2010 at 9:28 AM, C. Michael Pilato <cm...@collab.net> wrote:
>>>> The question then becomes, "How do users deal with legitimate partial dumps
>>>> that will be loaded atop something other than loads from previous
>>>> incremental windows?"  I think they do this the same way they handle the
>>>> copy case:  either hand-hacking the dumpstream or using 'svndumpfilter'.  So
>>>> maybe 'svndumpfilter' grows the logic and options required to pare away
>>>> mergeinfo that doesn't meet some criteria (such as "ranges outside the dump
>>>> stream range")?
>>>
>>> I vote for svndumpfilter growing the logic to filter ranges predating
>>> the start of the dump stream in the same way it strips mergeinfo with
>>> merge sources filtered from the dump stream.  Basically this means
>>> moving the r927243 functionality to svndumpfilter.  No really sure we
>>> need a new option, seems we could overload
>>> --skip-missing-merge-sources to include this behavior.  Thoughts?
>>
>> +1
>>
>>>> Julian Foad wrote:
>>>>> On Wed, 2010-04-14, C. Michael Pilato wrote:
>>>>>> Assuming similar behavior for mergeinfo handling, we have, at a minimum:
>>>>>>
>>>>>> 1.  'svnadmin dump' warns when it is in incremental mode and must generate
>>>>>> mergeinfo from a merge source that predates the beginning of the dump
>>>>>> window, but it's only a warning and the dump continues.
>>>
>>> Why only --incremental mode?  We should warn for both --incremental
>>> dumps *and* non-incremental dumps because in the latter case we might
>>> be specifying a dump range -rX:Y where rX>0* and there might be
>>> mergeinfo that refers to revs < X.  And while we might want to be
>>> clever about skipping the warning when no range is specified (i.e.
>>> defaulting to -r0:HEAD), even then we should warn because issue #3020
>>> might already have resulted in a r1 with mergeinfo in the starting
>>> repository.
>>
>> You're right, of course.  I was thinking "partial dump" and saying
>> "incremental".  My bad.
>>
>>>>>> 2.  'svnadmin load' does nothing smart, trusting that the dump it's being
>>>>>> fed is a sensible one.
>>>>> Better: 'svnadmin load' tries to validate the mergeinfo, and issues a
>>>>> warning if it refers to a non-existent source.  The admin then gets to
>>>>> figure out whether it was bad in the first place or because of his/her
>>>>> partial-dump/partial-load scenario.
>>>>>
>>>>> However, it depends how efficient the checking is.  If that would make
>>>>> the 'load' really slow, I can see that not being wanted.
>>
>> [...]
>>
>>> No unsurprisingly, this took significantly (62%) longer, 00:05:39.609.
>>
>> (My vote was for 'svnadmin load' does nothing smart, and I stand by it.)
>>
>>> Taking what you've had to say into consideration, I propose the following:
>>>
>>> 1) Revert http://svn.apache.org/viewvc?view=revision&revision=927243.
>>> This stops svnadmin load from automatically filtering mergeinfo
>>> referring to revisions outside of the load stream.
>>>
>>> 2) Reapply or reimplement the part of r327243 that accounts for the
>>> case where the first loaded revision in a load stream has mergeinfo,
>>> see http://subversion.tigris.org/issues/show_bug.cgi?id=3020#desc10
>>> item #1.
>>
>> (r927243, you mean.)
>>
>>> 3) Keep the inline copy-source warnings as they exist today in
>>> svnadmin dump, but add a new end-of-process summary warning e.g.:
>>>
>>>   * Dumped revision 504.
>>>   * Dumped revision 505.
>>>   * Dumped revision 506.
>>>   * Dumped revision 507.
>>>   WARNING: Referencing data in revision 250, which is older than the oldest
>>>   WARNING: dumped revision (504).  Loading this dump into an empty repository
>>>   WARNING: will fail.
>>>   * Dumped revision 58.
>>>   * Dumped revision 59.
>>>   * Dumped revision 510.
>>>   * Dumped revision 511.
>>>   .
>>>   .
>>>   .
>>>   WARNING: The range of revisions dumped contained references to copy
>>> sources outside that range.
>>>
>>> 4) Implement inline and summary functionality analogous to 3) but for
>>> mergeinfo ranges outside of the dump stream, e.g.:
>>>
>>>   * Dumped revision 504.
>>>   * Dumped revision 505.
>>>   * Dumped revision 506.
>>>   * Dumped revision 507.
>>>   WARNING: Mergeinfo referencing revision(s) prior to revision 250,
>>> which is older
>>>   WARNING: than the oldest dumped revision (504).  Loading this dump may result
>>>   WARNING: in invalid mergeinfo.
>>>   * Dumped revision 58.
>>>   * Dumped revision 59.
>>>   * Dumped revision 510.
>>>   * Dumped revision 511.
>>>   .
>>>   .
>>>   .
>>>   WARNING: The range of revisions dumped contained references to
>>> mergeinfo outside that range.
>>>
>>> Also, as mentioned earlier, I think this warning should kick in
>>> regardless of whether --incremental is used.
>>>
>>> 5) Overload svndumpfilter's --skip-missing-merge-sources option to
>>> also remove mergeinfo that predates the starting revision of the dump
>>> stream.
>>
>> +1 all over this stuff.
>>
>>> 6) Open: How does svnadmin load deal with mergeinfo that doesn't exist
>>> in the load steam or the target repos:
>>>
>>>    A) It doesn't.
>>>
>>>    B) Only checks that the revs are valid (fast, but
>>>       can allow mergeinfo that describes invalid
>>>       mergesource:rev pairs)
>>>
>>>    C) Find an efficient way to test for valid
>>>       mergesource:rev pairs
>>
>> For now, A (or maybe B).  Then we have our Paul back.  :-)
>>
>
> Been down the rabbit hole on-and-off the last couple weeks with this.
> Did everything we discussed so far in this thread, plus a slew of new
> bugs, see http://subversion.tigris.org/issues/show_bug.cgi?id=3020#desc15
> and onward.
>
> At any rate, after my recent work we are left with one common(?) use
> case still broken:
>
> Loading a sequence of incremental dumps when the fist load in the
> sequence is into a *non-empty* repostitory.
>
> This is tested in svnadmin_tests.py 20 'don't filter mergeinfo revs
> from incremental dump', specifically see '# PART 4: Load a a series of
> incremental dumps to an non-empty repository."
>
> Basically the problem is this:
>
> 1) Given repository ReposA with X revisions in it.
>
>   Note: Previously this use case was also broken if ReposB
>   was empty to start, but this now works (assuming none of
>   the incremental dumps in step 3 were svndumpfiltered in
>   such a way as to remove/renumber revisions).
>
> 2) Given repository ReposB with Y revisions in it.
>
> 3) Incrementally (but fully) dump ReposA
>
>   svnadmin dump ReposA -r0:100                 >  A.0.100.dump
>   svnadmin dump ReposA -r101:200 --incremental >  A.101.200.dump
>   svnadmin dump ReposA -r201:300 --incremental >  A.201.300.dump
>   .
>   .
>   .
>   svnadmin dump ReposA -rW:X     --incremental >  A.W.X.dump
>
> 4) Load the ReposA incremental dumps to ReposB
>
>   svnadmin load ReposB --parent-dir /some/subtree < A.0.100.dump
>   svnadmin load ReposB --parent-dir /some/subtree < A.101.200.dump
>   svnadmin load ReposB --parent-dir /some/subtree < A.201.300.dump
>   .
>   .
>   .
>   svnadmin load ReposB --parent-dir /some/subtree < A.W.X.dump
>
> The problem arises when one of the incremental loads contains
> references to mergeinfo source revisions that predate the start of the
> dump.  For example, say A.201.300.dump contains mergeinfo referencing
> r82.  When this is loaded into ReposB the reference should be changed
> to r(82+Y) to reflect the fact that ReposB has Y revisions in it
> before we ever started loading parts of ReposA.  Currently no change
> is made and the source rev stays at r82, which is obviously incorrect.
>
> The attached patch fixes this problem.  It takes all mergeinfo which
> predates the first revision in the dump stream and adjusts it by the
> difference between the head rev in ReposB and the current dump stream
> revision.  Yes, this is a fairly fragile solution (as is our copyfrom
> sources mapping, which this is based on).  If unrelated commits are
> made to ReposB between loads of the incremental dumps, then the
> mergeinfo source revs will be wrong; but it is *always* wrong right
> now, at least this would fix this use case.  Opinions appreciated.
>
> And yes, there are still other potential problems: Think incremental
> dumps that are dumpfiltered such that some revs are dropped or
> renumbered.  But I'm not sure we really want to solve *every* problem
> in this space, I'm not even sure we can (without bringing the
> performance of svnadmin load to its knees).  I think this patch brings
> us to "good enough" with the caveat that ill-advised use of svnadmin
> dump/load and/or svndumpfilter can result in incorrect mergeinfo.
>
> Paul
>

Re: [RFC] Incremental dumps and mergeinfo (Was: Vetoing latest issue #3020 fix in 1.6.10)

Posted by Paul Burba <pt...@gmail.com>.
On Tue, Apr 20, 2010 at 11:08 PM, C. Michael Pilato <cm...@collab.net> wrote:
> Paul Burba wrote:
>> Hi Mike and Julian,
>>
>> Sorry for the delayed response; been working on/thinking about other
>> mergey stuff lately...
>>
>> A few inline comments and then a proposed course of action at the end.
>>
>> On Thu, Apr 15, 2010 at 9:28 AM, C. Michael Pilato <cm...@collab.net> wrote:
>>> The question then becomes, "How do users deal with legitimate partial dumps
>>> that will be loaded atop something other than loads from previous
>>> incremental windows?"  I think they do this the same way they handle the
>>> copy case:  either hand-hacking the dumpstream or using 'svndumpfilter'.  So
>>> maybe 'svndumpfilter' grows the logic and options required to pare away
>>> mergeinfo that doesn't meet some criteria (such as "ranges outside the dump
>>> stream range")?
>>
>> I vote for svndumpfilter growing the logic to filter ranges predating
>> the start of the dump stream in the same way it strips mergeinfo with
>> merge sources filtered from the dump stream.  Basically this means
>> moving the r927243 functionality to svndumpfilter.  No really sure we
>> need a new option, seems we could overload
>> --skip-missing-merge-sources to include this behavior.  Thoughts?
>
> +1
>
>>> Julian Foad wrote:
>>>> On Wed, 2010-04-14, C. Michael Pilato wrote:
>>>>> Assuming similar behavior for mergeinfo handling, we have, at a minimum:
>>>>>
>>>>> 1.  'svnadmin dump' warns when it is in incremental mode and must generate
>>>>> mergeinfo from a merge source that predates the beginning of the dump
>>>>> window, but it's only a warning and the dump continues.
>>
>> Why only --incremental mode?  We should warn for both --incremental
>> dumps *and* non-incremental dumps because in the latter case we might
>> be specifying a dump range -rX:Y where rX>0* and there might be
>> mergeinfo that refers to revs < X.  And while we might want to be
>> clever about skipping the warning when no range is specified (i.e.
>> defaulting to -r0:HEAD), even then we should warn because issue #3020
>> might already have resulted in a r1 with mergeinfo in the starting
>> repository.
>
> You're right, of course.  I was thinking "partial dump" and saying
> "incremental".  My bad.
>
>>>>> 2.  'svnadmin load' does nothing smart, trusting that the dump it's being
>>>>> fed is a sensible one.
>>>> Better: 'svnadmin load' tries to validate the mergeinfo, and issues a
>>>> warning if it refers to a non-existent source.  The admin then gets to
>>>> figure out whether it was bad in the first place or because of his/her
>>>> partial-dump/partial-load scenario.
>>>>
>>>> However, it depends how efficient the checking is.  If that would make
>>>> the 'load' really slow, I can see that not being wanted.
>
> [...]
>
>> No unsurprisingly, this took significantly (62%) longer, 00:05:39.609.
>
> (My vote was for 'svnadmin load' does nothing smart, and I stand by it.)
>
>> Taking what you've had to say into consideration, I propose the following:
>>
>> 1) Revert http://svn.apache.org/viewvc?view=revision&revision=927243.
>> This stops svnadmin load from automatically filtering mergeinfo
>> referring to revisions outside of the load stream.
>>
>> 2) Reapply or reimplement the part of r327243 that accounts for the
>> case where the first loaded revision in a load stream has mergeinfo,
>> see http://subversion.tigris.org/issues/show_bug.cgi?id=3020#desc10
>> item #1.
>
> (r927243, you mean.)
>
>> 3) Keep the inline copy-source warnings as they exist today in
>> svnadmin dump, but add a new end-of-process summary warning e.g.:
>>
>>   * Dumped revision 504.
>>   * Dumped revision 505.
>>   * Dumped revision 506.
>>   * Dumped revision 507.
>>   WARNING: Referencing data in revision 250, which is older than the oldest
>>   WARNING: dumped revision (504).  Loading this dump into an empty repository
>>   WARNING: will fail.
>>   * Dumped revision 58.
>>   * Dumped revision 59.
>>   * Dumped revision 510.
>>   * Dumped revision 511.
>>   .
>>   .
>>   .
>>   WARNING: The range of revisions dumped contained references to copy
>> sources outside that range.
>>
>> 4) Implement inline and summary functionality analogous to 3) but for
>> mergeinfo ranges outside of the dump stream, e.g.:
>>
>>   * Dumped revision 504.
>>   * Dumped revision 505.
>>   * Dumped revision 506.
>>   * Dumped revision 507.
>>   WARNING: Mergeinfo referencing revision(s) prior to revision 250,
>> which is older
>>   WARNING: than the oldest dumped revision (504).  Loading this dump may result
>>   WARNING: in invalid mergeinfo.
>>   * Dumped revision 58.
>>   * Dumped revision 59.
>>   * Dumped revision 510.
>>   * Dumped revision 511.
>>   .
>>   .
>>   .
>>   WARNING: The range of revisions dumped contained references to
>> mergeinfo outside that range.
>>
>> Also, as mentioned earlier, I think this warning should kick in
>> regardless of whether --incremental is used.
>>
>> 5) Overload svndumpfilter's --skip-missing-merge-sources option to
>> also remove mergeinfo that predates the starting revision of the dump
>> stream.
>
> +1 all over this stuff.
>
>> 6) Open: How does svnadmin load deal with mergeinfo that doesn't exist
>> in the load steam or the target repos:
>>
>>    A) It doesn't.
>>
>>    B) Only checks that the revs are valid (fast, but
>>       can allow mergeinfo that describes invalid
>>       mergesource:rev pairs)
>>
>>    C) Find an efficient way to test for valid
>>       mergesource:rev pairs
>
> For now, A (or maybe B).  Then we have our Paul back.  :-)
>

Been down the rabbit hole on-and-off the last couple weeks with this.
Did everything we discussed so far in this thread, plus a slew of new
bugs, see http://subversion.tigris.org/issues/show_bug.cgi?id=3020#desc15
and onward.

At any rate, after my recent work we are left with one common(?) use
case still broken:

Loading a sequence of incremental dumps when the fist load in the
sequence is into a *non-empty* repostitory.

This is tested in svnadmin_tests.py 20 'don't filter mergeinfo revs
from incremental dump', specifically see '# PART 4: Load a a series of
incremental dumps to an non-empty repository."

Basically the problem is this:

1) Given repository ReposA with X revisions in it.

   Note: Previously this use case was also broken if ReposB
   was empty to start, but this now works (assuming none of
   the incremental dumps in step 3 were svndumpfiltered in
   such a way as to remove/renumber revisions).

2) Given repository ReposB with Y revisions in it.

3) Incrementally (but fully) dump ReposA

   svnadmin dump ReposA -r0:100                 >  A.0.100.dump
   svnadmin dump ReposA -r101:200 --incremental >  A.101.200.dump
   svnadmin dump ReposA -r201:300 --incremental >  A.201.300.dump
   .
   .
   .
   svnadmin dump ReposA -rW:X     --incremental >  A.W.X.dump

4) Load the ReposA incremental dumps to ReposB

   svnadmin load ReposB --parent-dir /some/subtree < A.0.100.dump
   svnadmin load ReposB --parent-dir /some/subtree < A.101.200.dump
   svnadmin load ReposB --parent-dir /some/subtree < A.201.300.dump
   .
   .
   .
   svnadmin load ReposB --parent-dir /some/subtree < A.W.X.dump

The problem arises when one of the incremental loads contains
references to mergeinfo source revisions that predate the start of the
dump.  For example, say A.201.300.dump contains mergeinfo referencing
r82.  When this is loaded into ReposB the reference should be changed
to r(82+Y) to reflect the fact that ReposB has Y revisions in it
before we ever started loading parts of ReposA.  Currently no change
is made and the source rev stays at r82, which is obviously incorrect.

The attached patch fixes this problem.  It takes all mergeinfo which
predates the first revision in the dump stream and adjusts it by the
difference between the head rev in ReposB and the current dump stream
revision.  Yes, this is a fairly fragile solution (as is our copyfrom
sources mapping, which this is based on).  If unrelated commits are
made to ReposB between loads of the incremental dumps, then the
mergeinfo source revs will be wrong; but it is *always* wrong right
now, at least this would fix this use case.  Opinions appreciated.

And yes, there are still other potential problems: Think incremental
dumps that are dumpfiltered such that some revs are dropped or
renumbered.  But I'm not sure we really want to solve *every* problem
in this space, I'm not even sure we can (without bringing the
performance of svnadmin load to its knees).  I think this patch brings
us to "good enough" with the caveat that ill-advised use of svnadmin
dump/load and/or svndumpfilter can result in incorrect mergeinfo.

Paul

Re: [RFC] Incremental dumps and mergeinfo (Was: Vetoing latest issue #3020 fix in 1.6.10)

Posted by "C. Michael Pilato" <cm...@collab.net>.
Paul Burba wrote:
> Hi Mike and Julian,
> 
> Sorry for the delayed response; been working on/thinking about other
> mergey stuff lately...
> 
> A few inline comments and then a proposed course of action at the end.
> 
> On Thu, Apr 15, 2010 at 9:28 AM, C. Michael Pilato <cm...@collab.net> wrote:
>> The question then becomes, "How do users deal with legitimate partial dumps
>> that will be loaded atop something other than loads from previous
>> incremental windows?"  I think they do this the same way they handle the
>> copy case:  either hand-hacking the dumpstream or using 'svndumpfilter'.  So
>> maybe 'svndumpfilter' grows the logic and options required to pare away
>> mergeinfo that doesn't meet some criteria (such as "ranges outside the dump
>> stream range")?
> 
> I vote for svndumpfilter growing the logic to filter ranges predating
> the start of the dump stream in the same way it strips mergeinfo with
> merge sources filtered from the dump stream.  Basically this means
> moving the r927243 functionality to svndumpfilter.  No really sure we
> need a new option, seems we could overload
> --skip-missing-merge-sources to include this behavior.  Thoughts?

+1

>> Julian Foad wrote:
>>> On Wed, 2010-04-14, C. Michael Pilato wrote:
>>>> Assuming similar behavior for mergeinfo handling, we have, at a minimum:
>>>>
>>>> 1.  'svnadmin dump' warns when it is in incremental mode and must generate
>>>> mergeinfo from a merge source that predates the beginning of the dump
>>>> window, but it's only a warning and the dump continues.
> 
> Why only --incremental mode?  We should warn for both --incremental
> dumps *and* non-incremental dumps because in the latter case we might
> be specifying a dump range -rX:Y where rX>0* and there might be
> mergeinfo that refers to revs < X.  And while we might want to be
> clever about skipping the warning when no range is specified (i.e.
> defaulting to -r0:HEAD), even then we should warn because issue #3020
> might already have resulted in a r1 with mergeinfo in the starting
> repository.

You're right, of course.  I was thinking "partial dump" and saying
"incremental".  My bad.

>>>> 2.  'svnadmin load' does nothing smart, trusting that the dump it's being
>>>> fed is a sensible one.
>>> Better: 'svnadmin load' tries to validate the mergeinfo, and issues a
>>> warning if it refers to a non-existent source.  The admin then gets to
>>> figure out whether it was bad in the first place or because of his/her
>>> partial-dump/partial-load scenario.
>>>
>>> However, it depends how efficient the checking is.  If that would make
>>> the 'load' really slow, I can see that not being wanted.

[...]

> No unsurprisingly, this took significantly (62%) longer, 00:05:39.609.

(My vote was for 'svnadmin load' does nothing smart, and I stand by it.)

> Taking what you've had to say into consideration, I propose the following:
> 
> 1) Revert http://svn.apache.org/viewvc?view=revision&revision=927243.
> This stops svnadmin load from automatically filtering mergeinfo
> referring to revisions outside of the load stream.
> 
> 2) Reapply or reimplement the part of r327243 that accounts for the
> case where the first loaded revision in a load stream has mergeinfo,
> see http://subversion.tigris.org/issues/show_bug.cgi?id=3020#desc10
> item #1.

(r927243, you mean.)

> 3) Keep the inline copy-source warnings as they exist today in
> svnadmin dump, but add a new end-of-process summary warning e.g.:
> 
>   * Dumped revision 504.
>   * Dumped revision 505.
>   * Dumped revision 506.
>   * Dumped revision 507.
>   WARNING: Referencing data in revision 250, which is older than the oldest
>   WARNING: dumped revision (504).  Loading this dump into an empty repository
>   WARNING: will fail.
>   * Dumped revision 58.
>   * Dumped revision 59.
>   * Dumped revision 510.
>   * Dumped revision 511.
>   .
>   .
>   .
>   WARNING: The range of revisions dumped contained references to copy
> sources outside that range.
> 
> 4) Implement inline and summary functionality analogous to 3) but for
> mergeinfo ranges outside of the dump stream, e.g.:
> 
>   * Dumped revision 504.
>   * Dumped revision 505.
>   * Dumped revision 506.
>   * Dumped revision 507.
>   WARNING: Mergeinfo referencing revision(s) prior to revision 250,
> which is older
>   WARNING: than the oldest dumped revision (504).  Loading this dump may result
>   WARNING: in invalid mergeinfo.
>   * Dumped revision 58.
>   * Dumped revision 59.
>   * Dumped revision 510.
>   * Dumped revision 511.
>   .
>   .
>   .
>   WARNING: The range of revisions dumped contained references to
> mergeinfo outside that range.
> 
> Also, as mentioned earlier, I think this warning should kick in
> regardless of whether --incremental is used.
> 
> 5) Overload svndumpfilter's --skip-missing-merge-sources option to
> also remove mergeinfo that predates the starting revision of the dump
> stream.

+1 all over this stuff.

> 6) Open: How does svnadmin load deal with mergeinfo that doesn't exist
> in the load steam or the target repos:
> 
>    A) It doesn't.
> 
>    B) Only checks that the revs are valid (fast, but
>       can allow mergeinfo that describes invalid
>       mergesource:rev pairs)
> 
>    C) Find an efficient way to test for valid
>       mergesource:rev pairs

For now, A (or maybe B).  Then we have our Paul back.  :-)

-- 
C. Michael Pilato <cm...@collab.net>
CollabNet   <>   www.collab.net   <>   Distributed Development On Demand

Re: [RFC] Incremental dumps and mergeinfo (Was: Vetoing latest issue #3020 fix in 1.6.10)

Posted by Paul Burba <pt...@gmail.com>.
Hi Mike and Julian,

Sorry for the delayed response; been working on/thinking about other
mergey stuff lately...

A few inline comments and then a proposed course of action at the end.

On Thu, Apr 15, 2010 at 9:28 AM, C. Michael Pilato <cm...@collab.net> wrote:
>
> The question then becomes, "How do users deal with legitimate partial dumps
> that will be loaded atop something other than loads from previous
> incremental windows?"  I think they do this the same way they handle the
> copy case:  either hand-hacking the dumpstream or using 'svndumpfilter'.  So
> maybe 'svndumpfilter' grows the logic and options required to pare away
> mergeinfo that doesn't meet some criteria (such as "ranges outside the dump
> stream range")?

I vote for svndumpfilter growing the logic to filter ranges predating
the start of the dump stream in the same way it strips mergeinfo with
merge sources filtered from the dump stream.  Basically this means
moving the r927243 functionality to svndumpfilter.  No really sure we
need a new option, seems we could overload
--skip-missing-merge-sources to include this behavior.  Thoughts?

> Julian Foad wrote:
>> On Wed, 2010-04-14, C. Michael Pilato wrote:
>>> Assuming similar behavior for mergeinfo handling, we have, at a minimum:
>>>
>>> 1.  'svnadmin dump' warns when it is in incremental mode and must generate
>>> mergeinfo from a merge source that predates the beginning of the dump
>>> window, but it's only a warning and the dump continues.

Why only --incremental mode?  We should warn for both --incremental
dumps *and* non-incremental dumps because in the latter case we might
be specifying a dump range -rX:Y where rX>0* and there might be
mergeinfo that refers to revs < X.  And while we might want to be
clever about skipping the warning when no range is specified (i.e.
defaulting to -r0:HEAD), even then we should warn because issue #3020
might already have resulted in a r1 with mergeinfo in the starting
repository.

* Admittedly we could probably choose some other minimum rev, maybe r3
(which is the minimum numbers of revisions I can think to create
mergeinfo; r1 - populate the repos, r2 - create a branch and make some
changes on the branch parent, r3 - merge r2 to the branch.

>> +1 on that, both for consistency with copy-source info and because it's
>> a good policy anyway.
>>
>> A detail: I haven't tried it, but would want it to print this warning
>> (and the copy-source one) at the end of the dump, and not just in the
>> middle of the long list of "Dumping rX" messages.
>
> Currently the warnings are all "inline".  But I agree that it would be
> beneficial to also just track boolean flags (had_suspect_copies,
> had_suspect_mergeinfo) throughout the dump and repeat warnings at the end:
>
> if had_suspect_copies:
>  print "WARNING: The range of revisions dumped contained references "
>        "to copy sources outside that range."
> if had_suspect_mergeinfo:
>  print "WARNING: The range of revisions dumped contained references "
>        "to mergeinfo outside that range."
>
>>> 2.  'svnadmin load' does nothing smart, trusting that the dump it's being
>>> fed is a sensible one.
>>
>> Better: 'svnadmin load' tries to validate the mergeinfo, and issues a
>> warning if it refers to a non-existent source.  The admin then gets to
>> figure out whether it was bad in the first place or because of his/her
>> partial-dump/partial-load scenario.
>>
>> However, it depends how efficient the checking is.  If that would make
>> the 'load' really slow, I can see that not being wanted.

I created a test repository with 1000 text changes, then 1000 merges
of each change, one at a time, to a branch.  Then I dumped
(non-incrementally) the first 1000 revs to a dump file and the second
1000 revs to a second dump file.  I loaded the first dump file into
two empty repositories.  Then I used trunk build of svnadmin to load
the second dump file into the first repos.  This took 00:03:28.890.
Then I used svnadmin with the attached patch to do the same with the
second repository.  The patch uses svn_fs_check_path() to check if
each merge source-revision pair in the load stream which predates the
first rev of the load stream, already exists in the target repository
(which in this example they all do, there will be no filtering).  No
unsurprisingly, this took significantly (62%) longer, 00:05:39.609.

This patch is fairly inefficient since it checks the same path-rev
pairs multiple times.  E.g.: We load a dump file with orginignal
revisions start at r50.  At r100 we encounter our first mergeinfo,
'/trunk:r50' on some path.  We check that trunk@50 exists, it does,
and we allow the mergeinfo.  Then we load r101, the mergeinfo on the
same path is now '/trunk:50-51'.  We check that trunk@50 and trunk@51
both exist...and so on.  We need to leverage that first check of
trunk@50 and not make it again (and again and again).  Keep a
mergeinfo catalog of checked paths?  That would work, but memory use
might be a problem with big repositories...

...So it can be made faster, but before I go further with it, does
svn_fs_check_path() seem the right way to go?  I'm no FS guru, so
maybe there is a better way that is not obvious to me.

> Agreed.
>
> --
> C. Michael Pilato <cm...@collab.net>
> CollabNet   <>   www.collab.net   <>   Distributed Development On Demand

Taking what you've had to say into consideration, I propose the following:

1) Revert http://svn.apache.org/viewvc?view=revision&revision=927243.
This stops svnadmin load from automatically filtering mergeinfo
referring to revisions outside of the load stream.

2) Reapply or reimplement the part of r327243 that accounts for the
case where the first loaded revision in a load stream has mergeinfo,
see http://subversion.tigris.org/issues/show_bug.cgi?id=3020#desc10
item #1.

3) Keep the inline copy-source warnings as they exist today in
svnadmin dump, but add a new end-of-process summary warning e.g.:

  * Dumped revision 504.
  * Dumped revision 505.
  * Dumped revision 506.
  * Dumped revision 507.
  WARNING: Referencing data in revision 250, which is older than the oldest
  WARNING: dumped revision (504).  Loading this dump into an empty repository
  WARNING: will fail.
  * Dumped revision 58.
  * Dumped revision 59.
  * Dumped revision 510.
  * Dumped revision 511.
  .
  .
  .
  WARNING: The range of revisions dumped contained references to copy
sources outside that range.

4) Implement inline and summary functionality analogous to 3) but for
mergeinfo ranges outside of the dump stream, e.g.:

  * Dumped revision 504.
  * Dumped revision 505.
  * Dumped revision 506.
  * Dumped revision 507.
  WARNING: Mergeinfo referencing revision(s) prior to revision 250,
which is older
  WARNING: than the oldest dumped revision (504).  Loading this dump may result
  WARNING: in invalid mergeinfo.
  * Dumped revision 58.
  * Dumped revision 59.
  * Dumped revision 510.
  * Dumped revision 511.
  .
  .
  .
  WARNING: The range of revisions dumped contained references to
mergeinfo outside that range.

Also, as mentioned earlier, I think this warning should kick in
regardless of whether --incremental is used.

5) Overload svndumpfilter's --skip-missing-merge-sources option to
also remove mergeinfo that predates the starting revision of the dump
stream.

6) Open: How does svnadmin load deal with mergeinfo that doesn't exist
in the load steam or the target repos:

   A) It doesn't.

   B) Only checks that the revs are valid (fast, but
      can allow mergeinfo that describes invalid
      mergesource:rev pairs)

   C) Find an efficient way to test for valid
      mergesource:rev pairs

Paul

Re: [RFC] Incremental dumps and mergeinfo (Was: Vetoing latest issue #3020 fix in 1.6.10)

Posted by Paul Burba <pt...@gmail.com>.
Hi Mike and Julian,

Sorry for the delayed response; been working on/thinking about other
mergey stuff lately...

A few inline comments and then a proposed course of action at the end.

On Thu, Apr 15, 2010 at 9:28 AM, C. Michael Pilato <cm...@collab.net> wrote:
>
> The question then becomes, "How do users deal with legitimate partial dumps
> that will be loaded atop something other than loads from previous
> incremental windows?"  I think they do this the same way they handle the
> copy case:  either hand-hacking the dumpstream or using 'svndumpfilter'.  So
> maybe 'svndumpfilter' grows the logic and options required to pare away
> mergeinfo that doesn't meet some criteria (such as "ranges outside the dump
> stream range")?

I vote for svndumpfilter growing the logic to filter ranges predating
the start of the dump stream in the same way it strips mergeinfo with
merge sources filtered from the dump stream.  Basically this means
moving the r927243 functionality to svndumpfilter.  No really sure we
need a new option, seems we could overload
--skip-missing-merge-sources to include this behavior.  Thoughts?

> Julian Foad wrote:
>> On Wed, 2010-04-14, C. Michael Pilato wrote:
>>> Assuming similar behavior for mergeinfo handling, we have, at a minimum:
>>>
>>> 1.  'svnadmin dump' warns when it is in incremental mode and must generate
>>> mergeinfo from a merge source that predates the beginning of the dump
>>> window, but it's only a warning and the dump continues.

Why only --incremental mode?  We should warn for both --incremental
dumps *and* non-incremental dumps because in the latter case we might
be specifying a dump range -rX:Y where rX>0* and there might be
mergeinfo that refers to revs < X.  And while we might want to be
clever about skipping the warning when no range is specified (i.e.
defaulting to -r0:HEAD), even then we should warn because issue #3020
might already have resulted in a r1 with mergeinfo in the starting
repository.

* Admittedly we could probably choose some other minimum rev, maybe r3
(which is the minimum numbers of revisions I can think to create
mergeinfo; r1 - populate the repos, r2 - create a branch and make some
changes on the branch parent, r3 - merge r2 to the branch.

>> +1 on that, both for consistency with copy-source info and because it's
>> a good policy anyway.
>>
>> A detail: I haven't tried it, but would want it to print this warning
>> (and the copy-source one) at the end of the dump, and not just in the
>> middle of the long list of "Dumping rX" messages.
>
> Currently the warnings are all "inline".  But I agree that it would be
> beneficial to also just track boolean flags (had_suspect_copies,
> had_suspect_mergeinfo) throughout the dump and repeat warnings at the end:
>
> if had_suspect_copies:
>  print "WARNING: The range of revisions dumped contained references "
>        "to copy sources outside that range."
> if had_suspect_mergeinfo:
>  print "WARNING: The range of revisions dumped contained references "
>        "to mergeinfo outside that range."
>
>>> 2.  'svnadmin load' does nothing smart, trusting that the dump it's being
>>> fed is a sensible one.
>>
>> Better: 'svnadmin load' tries to validate the mergeinfo, and issues a
>> warning if it refers to a non-existent source.  The admin then gets to
>> figure out whether it was bad in the first place or because of his/her
>> partial-dump/partial-load scenario.
>>
>> However, it depends how efficient the checking is.  If that would make
>> the 'load' really slow, I can see that not being wanted.

I created a test repository with 1000 text changes, then 1000 merges
of each change, one at a time, to a branch.  Then I dumped
(non-incrementally) the first 1000 revs to a dump file and the second
1000 revs to a second dump file.  I loaded the first dump file into
two empty repositories.  Then I used trunk build of svnadmin to load
the second dump file into the first repos.  This took 00:03:28.890.
Then I used svnadmin with the attached patch to do the same with the
second repository.  The patch uses svn_fs_check_path() to check if
each merge source-revision pair in the load stream which predates the
first rev of the load stream, already exists in the target repository
(which in this example they all do, there will be no filtering).  No
unsurprisingly, this took significantly (62%) longer, 00:05:39.609.

This patch is fairly inefficient since it checks the same path-rev
pairs multiple times.  E.g.: We load a dump file with orginignal
revisions start at r50.  At r100 we encounter our first mergeinfo,
'/trunk:r50' on some path.  We check that trunk@50 exists, it does,
and we allow the mergeinfo.  Then we load r101, the mergeinfo on the
same path is now '/trunk:50-51'.  We check that trunk@50 and trunk@51
both exist...and so on.  We need to leverage that first check of
trunk@50 and not make it again (and again and again).  Keep a
mergeinfo catalog of checked paths?  That would work, but memory use
might be a problem with big repositories...

...So it can be made faster, but before I go further with it, does
svn_fs_check_path() seem the right way to go?  I'm no FS guru, so
maybe there is a better way that is not obvious to me.

> Agreed.
>
> --
> C. Michael Pilato <cm...@collab.net>
> CollabNet   <>   www.collab.net   <>   Distributed Development On Demand

Taking what you've had to say into consideration, I propose the following:

1) Revert http://svn.apache.org/viewvc?view=revision&revision=927243.
This stops svnadmin load from automatically filtering mergeinfo
referring to revisions outside of the load stream.

2) Reapply or reimplement the part of r327243 that accounts for the
case where the first loaded revision in a load stream has mergeinfo,
see http://subversion.tigris.org/issues/show_bug.cgi?id=3020#desc10
item #1.

3) Keep the inline copy-source warnings as they exist today in
svnadmin dump, but add a new end-of-process summary warning e.g.:

  * Dumped revision 504.
  * Dumped revision 505.
  * Dumped revision 506.
  * Dumped revision 507.
  WARNING: Referencing data in revision 250, which is older than the oldest
  WARNING: dumped revision (504).  Loading this dump into an empty repository
  WARNING: will fail.
  * Dumped revision 58.
  * Dumped revision 59.
  * Dumped revision 510.
  * Dumped revision 511.
  .
  .
  .
  WARNING: The range of revisions dumped contained references to copy
sources outside that range.

4) Implement inline and summary functionality analogous to 3) but for
mergeinfo ranges outside of the dump stream, e.g.:

  * Dumped revision 504.
  * Dumped revision 505.
  * Dumped revision 506.
  * Dumped revision 507.
  WARNING: Mergeinfo referencing revision(s) prior to revision 250,
which is older
  WARNING: than the oldest dumped revision (504).  Loading this dump may result
  WARNING: in invalid mergeinfo.
  * Dumped revision 58.
  * Dumped revision 59.
  * Dumped revision 510.
  * Dumped revision 511.
  .
  .
  .
  WARNING: The range of revisions dumped contained references to
mergeinfo outside that range.

Also, as mentioned earlier, I think this warning should kick in
regardless of whether --incremental is used.

5) Overload svndumpfilter's --skip-missing-merge-sources option to
also remove mergeinfo that predates the starting revision of the dump
stream.

6) Open: How does svnadmin load deal with mergeinfo that doesn't exist
in the load steam or the target repos:

   A) It doesn't.

   B) Only checks that the revs are valid (fast, but
      can allow mergeinfo that describes invalid
      mergesource:rev pairs)

   C) Find an efficient way to test for valid
      mergesource:rev pairs

Paul

Re: [RFC] Incremental dumps and mergeinfo (Was: Vetoing latest issue #3020 fix in 1.6.10)

Posted by "C. Michael Pilato" <cm...@collab.net>.
Julian Foad wrote:
> On Wed, 2010-04-14, C. Michael Pilato wrote:
>> Assuming similar behavior for mergeinfo handling, we have, at a minimum:
>>
>> 1.  'svnadmin dump' warns when it is in incremental mode and must generate
>> mergeinfo from a merge source that predates the beginning of the dump
>> window, but it's only a warning and the dump continues.
> 
> +1 on that, both for consistency with copy-source info and because it's
> a good policy anyway.
> 
> A detail: I haven't tried it, but would want it to print this warning
> (and the copy-source one) at the end of the dump, and not just in the
> middle of the long list of "Dumping rX" messages.

Currently the warnings are all "inline".  But I agree that it would be
beneficial to also just track boolean flags (had_suspect_copies,
had_suspect_mergeinfo) throughout the dump and repeat warnings at the end:

if had_suspect_copies:
  print "WARNING: The range of revisions dumped contained references "
        "to copy sources outside that range."
if had_suspect_mergeinfo:
  print "WARNING: The range of revisions dumped contained references "
        "to mergeinfo outside that range."

>> 2.  'svnadmin load' does nothing smart, trusting that the dump it's being
>> fed is a sensible one.
> 
> Better: 'svnadmin load' tries to validate the mergeinfo, and issues a
> warning if it refers to a non-existent source.  The admin then gets to
> figure out whether it was bad in the first place or because of his/her
> partial-dump/partial-load scenario.
> 
> However, it depends how efficient the checking is.  If that would make
> the 'load' really slow, I can see that not being wanted.

Agreed.

-- 
C. Michael Pilato <cm...@collab.net>
CollabNet   <>   www.collab.net   <>   Distributed Development On Demand


Re: [RFC] Incremental dumps and mergeinfo (Was: Vetoing latest issue #3020 fix in 1.6.10)

Posted by Julian Foad <ju...@wandisco.com>.
On Wed, 2010-04-14, C. Michael Pilato wrote:
> Paul Burba wrote:
> > After thinking about this some more I see three options:
> > 
> > A) Keep the pre-927243 behavior as the default and thus support the
> > incremental dump use case by default.  Add a new option to svnadmin
> > load, --skip-missing-merge-sources (or maybe
> > --filter-missing-merge-sources) to activate the filtering of mergeinfo
> > sources outside of the dump stream (i.e. make the r927243 changes only
> > take affect with this option).  The obvious drawback is that admins
> > must know to use this option.  They would still be able to partially
> > dump a repos then load it into an empty (or unrelated) repos and have
> > bogus mergeinfo.
> > 
> > B) If the load stream's mergeinfo contains a merge source-rev pair
> > that predates the start of the load stream, but exists in the target
> > repository, then allow it to be loaded, otherwise filter it.  The
> > drawbacks are two:  First, performance; we'd need to check every
> > path/rev pair of incoming mergeinfo which certainly isn't going to
> > speed up a load*.  Second, it may be mere coincidence that the
> > path/rev exists in the target repository at the start of the load.
> > 
> > C) Revert r927243 and move the fix into svnadmin load: Give an error
> > when doing a partial dump that contains mergeinfo with revisions that
> > predate the starting rev of the dump and require the use of a new
> > --missing-merge-source=[skip|allow] option to successfully complete
> > the dump (i.e. something analogous to svndumpfilter's
> > --skip-missing-merge-sources).
> > 
> > Of course there is always:
> > 
> > D) <Insert your brilliant idea here>
> 
> [Thinking aloud here.]
> 
> We have analogous behavior today with the handling of copied paths.  There
> are two aspects of this handling which might serve as models in this new
> scenario:
> 
> 1.  'svnadmin dump' warns when it is in incremental mode and must generate a
> copy action from a source that predates the beginning of the dump window,
> but it's only a warning and the dump continues.

> 2.  'svnadmin load' does nothing smart, trusting that the dump it's being
> fed is a sensible one.  If it gets a request to copy some path/rev that
> doesn't exist, stuff errors out and the repository is left in whatever state
> it was in prior to the revision that failed to load.
> 
> Assuming similar behavior for mergeinfo handling, we have, at a minimum:
> 
> 1.  'svnadmin dump' warns when it is in incremental mode and must generate
> mergeinfo from a merge source that predates the beginning of the dump
> window, but it's only a warning and the dump continues.

+1 on that, both for consistency with copy-source info and because it's
a good policy anyway.

A detail: I haven't tried it, but would want it to print this warning
(and the copy-source one) at the end of the dump, and not just in the
middle of the long list of "Dumping rX" messages.

> Now, there's some question here about how to handle part (2).  In the copied
> path case, 'svnadmin load' can be stupid because the failure case is a noisy
> one -- stuff errors out from deep within the FS layer.  That's not quite the
> same failure case for mergeinfo, where bad mergeinfo would be silently
> recorded without error.  So we *could* try to validate the mergeinfo somehow
> against the history recorded in the target repository.  But there follows
> another complication:  it's really not so hard to get bad (if perhaps
> innocuous) mergeinfo into a repository by normal means, and you certainly
> wouldn't expect that to cost you the ability to dump and load your repository!
> 
> So I'm really left with this:
> 
> 2.  'svnadmin load' does nothing smart, trusting that the dump it's being
> fed is a sensible one.

Better: 'svnadmin load' tries to validate the mergeinfo, and issues a
warning if it refers to a non-existent source.  The admin then gets to
figure out whether it was bad in the first place or because of his/her
partial-dump/partial-load scenario.

However, it depends how efficient the checking is.  If that would make
the 'load' really slow, I can see that not being wanted.

- Julian


> As a net change against our shipping code, that's really just, what, the
> addition of warnings in 'svnadmin dump'?
> 
> The question then becomes, "How do users deal with legitimate partial dumps
> that will be loaded atop something other than loads from previous
> incremental windows?"  I think they do this the same way they handle the
> copy case:  either hand-hacking the dumpstream or using 'svndumpfilter'.  So
> maybe 'svndumpfilter' grows the logic and options required to pare away
> mergeinfo that doesn't meet some criteria (such as "ranges outside the dump
> stream range")?
> 


Re: [RFC] Incremental dumps and mergeinfo (Was: Vetoing latest issue #3020 fix in 1.6.10)

Posted by "C. Michael Pilato" <cm...@collab.net>.
Paul Burba wrote:
> After thinking about this some more I see three options:
> 
> A) Keep the pre-927243 behavior as the default and thus support the
> incremental dump use case by default.  Add a new option to svnadmin
> load, --skip-missing-merge-sources (or maybe
> --filter-missing-merge-sources) to activate the filtering of mergeinfo
> sources outside of the dump stream (i.e. make the r927243 changes only
> take affect with this option).  The obvious drawback is that admins
> must know to use this option.  They would still be able to partially
> dump a repos then load it into an empty (or unrelated) repos and have
> bogus mergeinfo.
> 
> B) If the load stream's mergeinfo contains a merge source-rev pair
> that predates the start of the load stream, but exists in the target
> repository, then allow it to be loaded, otherwise filter it.  The
> drawbacks are two:  First, performance; we'd need to check every
> path/rev pair of incoming mergeinfo which certainly isn't going to
> speed up a load*.  Second, it may be mere coincidence that the
> path/rev exists in the target repository at the start of the load.
> 
> C) Revert r927243 and move the fix into svnadmin load: Give an error
> when doing a partial dump that contains mergeinfo with revisions that
> predate the starting rev of the dump and require the use of a new
> --missing-merge-source=[skip|allow] option to successfully complete
> the dump (i.e. something analogous to svndumpfilter's
> --skip-missing-merge-sources).
> 
> Of course there is always:
> 
> D) <Insert your brilliant idea here>

[Thinking aloud here.]

We have analogous behavior today with the handling of copied paths.  There
are two aspects of this handling which might serve as models in this new
scenario:

1.  'svnadmin dump' warns when it is in incremental mode and must generate a
copy action from a source that predates the beginning of the dump window,
but it's only a warning and the dump continues.

2.  'svnadmin load' does nothing smart, trusting that the dump it's being
fed is a sensible one.  If it gets a request to copy some path/rev that
doesn't exist, stuff errors out and the repository is left in whatever state
it was in prior to the revision that failed to load.

Assuming similar behavior for mergeinfo handling, we have, at a minimum:

1.  'svnadmin dump' warns when it is in incremental mode and must generate
mergeinfo from a merge source that predates the beginning of the dump
window, but it's only a warning and the dump continues.

Now, there's some question here about how to handle part (2).  In the copied
path case, 'svnadmin load' can be stupid because the failure case is a noisy
one -- stuff errors out from deep within the FS layer.  That's not quite the
same failure case for mergeinfo, where bad mergeinfo would be silently
recorded without error.  So we *could* try to validate the mergeinfo somehow
against the history recorded in the target repository.  But there follows
another complication:  it's really not so hard to get bad (if perhaps
innocuous) mergeinfo into a repository by normal means, and you certainly
wouldn't expect that to cost you the ability to dump and load your repository!

So I'm really left with this:

2.  'svnadmin load' does nothing smart, trusting that the dump it's being
fed is a sensible one.

As a net change against our shipping code, that's really just, what, the
addition of warnings in 'svnadmin dump'?

The question then becomes, "How do users deal with legitimate partial dumps
that will be loaded atop something other than loads from previous
incremental windows?"  I think they do this the same way they handle the
copy case:  either hand-hacking the dumpstream or using 'svndumpfilter'.  So
maybe 'svndumpfilter' grows the logic and options required to pare away
mergeinfo that doesn't meet some criteria (such as "ranges outside the dump
stream range")?

-- 
C. Michael Pilato <cm...@collab.net>
CollabNet   <>   www.collab.net   <>   Distributed Development On Demand