You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@subversion.apache.org by Paul Burba <pt...@gmail.com> on 2010/05/11 20:00:56 UTC

Re: [RFC] Incremental dumps and mergeinfo (Was: Vetoing latest issue #3020 fix in 1.6.10)

On Tue, Apr 20, 2010 at 11:08 PM, C. Michael Pilato <cm...@collab.net> wrote:
> Paul Burba wrote:
>> Hi Mike and Julian,
>>
>> Sorry for the delayed response; been working on/thinking about other
>> mergey stuff lately...
>>
>> A few inline comments and then a proposed course of action at the end.
>>
>> On Thu, Apr 15, 2010 at 9:28 AM, C. Michael Pilato <cm...@collab.net> wrote:
>>> The question then becomes, "How do users deal with legitimate partial dumps
>>> that will be loaded atop something other than loads from previous
>>> incremental windows?"  I think they do this the same way they handle the
>>> copy case:  either hand-hacking the dumpstream or using 'svndumpfilter'.  So
>>> maybe 'svndumpfilter' grows the logic and options required to pare away
>>> mergeinfo that doesn't meet some criteria (such as "ranges outside the dump
>>> stream range")?
>>
>> I vote for svndumpfilter growing the logic to filter ranges predating
>> the start of the dump stream in the same way it strips mergeinfo with
>> merge sources filtered from the dump stream.  Basically this means
>> moving the r927243 functionality to svndumpfilter.  No really sure we
>> need a new option, seems we could overload
>> --skip-missing-merge-sources to include this behavior.  Thoughts?
>
> +1
>
>>> Julian Foad wrote:
>>>> On Wed, 2010-04-14, C. Michael Pilato wrote:
>>>>> Assuming similar behavior for mergeinfo handling, we have, at a minimum:
>>>>>
>>>>> 1.  'svnadmin dump' warns when it is in incremental mode and must generate
>>>>> mergeinfo from a merge source that predates the beginning of the dump
>>>>> window, but it's only a warning and the dump continues.
>>
>> Why only --incremental mode?  We should warn for both --incremental
>> dumps *and* non-incremental dumps because in the latter case we might
>> be specifying a dump range -rX:Y where rX>0* and there might be
>> mergeinfo that refers to revs < X.  And while we might want to be
>> clever about skipping the warning when no range is specified (i.e.
>> defaulting to -r0:HEAD), even then we should warn because issue #3020
>> might already have resulted in a r1 with mergeinfo in the starting
>> repository.
>
> You're right, of course.  I was thinking "partial dump" and saying
> "incremental".  My bad.
>
>>>>> 2.  'svnadmin load' does nothing smart, trusting that the dump it's being
>>>>> fed is a sensible one.
>>>> Better: 'svnadmin load' tries to validate the mergeinfo, and issues a
>>>> warning if it refers to a non-existent source.  The admin then gets to
>>>> figure out whether it was bad in the first place or because of his/her
>>>> partial-dump/partial-load scenario.
>>>>
>>>> However, it depends how efficient the checking is.  If that would make
>>>> the 'load' really slow, I can see that not being wanted.
>
> [...]
>
>> No unsurprisingly, this took significantly (62%) longer, 00:05:39.609.
>
> (My vote was for 'svnadmin load' does nothing smart, and I stand by it.)
>
>> Taking what you've had to say into consideration, I propose the following:
>>
>> 1) Revert http://svn.apache.org/viewvc?view=revision&revision=927243.
>> This stops svnadmin load from automatically filtering mergeinfo
>> referring to revisions outside of the load stream.
>>
>> 2) Reapply or reimplement the part of r327243 that accounts for the
>> case where the first loaded revision in a load stream has mergeinfo,
>> see http://subversion.tigris.org/issues/show_bug.cgi?id=3020#desc10
>> item #1.
>
> (r927243, you mean.)
>
>> 3) Keep the inline copy-source warnings as they exist today in
>> svnadmin dump, but add a new end-of-process summary warning e.g.:
>>
>>   * Dumped revision 504.
>>   * Dumped revision 505.
>>   * Dumped revision 506.
>>   * Dumped revision 507.
>>   WARNING: Referencing data in revision 250, which is older than the oldest
>>   WARNING: dumped revision (504).  Loading this dump into an empty repository
>>   WARNING: will fail.
>>   * Dumped revision 58.
>>   * Dumped revision 59.
>>   * Dumped revision 510.
>>   * Dumped revision 511.
>>   .
>>   .
>>   .
>>   WARNING: The range of revisions dumped contained references to copy
>> sources outside that range.
>>
>> 4) Implement inline and summary functionality analogous to 3) but for
>> mergeinfo ranges outside of the dump stream, e.g.:
>>
>>   * Dumped revision 504.
>>   * Dumped revision 505.
>>   * Dumped revision 506.
>>   * Dumped revision 507.
>>   WARNING: Mergeinfo referencing revision(s) prior to revision 250,
>> which is older
>>   WARNING: than the oldest dumped revision (504).  Loading this dump may result
>>   WARNING: in invalid mergeinfo.
>>   * Dumped revision 58.
>>   * Dumped revision 59.
>>   * Dumped revision 510.
>>   * Dumped revision 511.
>>   .
>>   .
>>   .
>>   WARNING: The range of revisions dumped contained references to
>> mergeinfo outside that range.
>>
>> Also, as mentioned earlier, I think this warning should kick in
>> regardless of whether --incremental is used.
>>
>> 5) Overload svndumpfilter's --skip-missing-merge-sources option to
>> also remove mergeinfo that predates the starting revision of the dump
>> stream.
>
> +1 all over this stuff.
>
>> 6) Open: How does svnadmin load deal with mergeinfo that doesn't exist
>> in the load steam or the target repos:
>>
>>    A) It doesn't.
>>
>>    B) Only checks that the revs are valid (fast, but
>>       can allow mergeinfo that describes invalid
>>       mergesource:rev pairs)
>>
>>    C) Find an efficient way to test for valid
>>       mergesource:rev pairs
>
> For now, A (or maybe B).  Then we have our Paul back.  :-)
>

Been down the rabbit hole on-and-off the last couple weeks with this.
Did everything we discussed so far in this thread, plus a slew of new
bugs, see http://subversion.tigris.org/issues/show_bug.cgi?id=3020#desc15
and onward.

At any rate, after my recent work we are left with one common(?) use
case still broken:

Loading a sequence of incremental dumps when the fist load in the
sequence is into a *non-empty* repostitory.

This is tested in svnadmin_tests.py 20 'don't filter mergeinfo revs
from incremental dump', specifically see '# PART 4: Load a a series of
incremental dumps to an non-empty repository."

Basically the problem is this:

1) Given repository ReposA with X revisions in it.

   Note: Previously this use case was also broken if ReposB
   was empty to start, but this now works (assuming none of
   the incremental dumps in step 3 were svndumpfiltered in
   such a way as to remove/renumber revisions).

2) Given repository ReposB with Y revisions in it.

3) Incrementally (but fully) dump ReposA

   svnadmin dump ReposA -r0:100                 >  A.0.100.dump
   svnadmin dump ReposA -r101:200 --incremental >  A.101.200.dump
   svnadmin dump ReposA -r201:300 --incremental >  A.201.300.dump
   .
   .
   .
   svnadmin dump ReposA -rW:X     --incremental >  A.W.X.dump

4) Load the ReposA incremental dumps to ReposB

   svnadmin load ReposB --parent-dir /some/subtree < A.0.100.dump
   svnadmin load ReposB --parent-dir /some/subtree < A.101.200.dump
   svnadmin load ReposB --parent-dir /some/subtree < A.201.300.dump
   .
   .
   .
   svnadmin load ReposB --parent-dir /some/subtree < A.W.X.dump

The problem arises when one of the incremental loads contains
references to mergeinfo source revisions that predate the start of the
dump.  For example, say A.201.300.dump contains mergeinfo referencing
r82.  When this is loaded into ReposB the reference should be changed
to r(82+Y) to reflect the fact that ReposB has Y revisions in it
before we ever started loading parts of ReposA.  Currently no change
is made and the source rev stays at r82, which is obviously incorrect.

The attached patch fixes this problem.  It takes all mergeinfo which
predates the first revision in the dump stream and adjusts it by the
difference between the head rev in ReposB and the current dump stream
revision.  Yes, this is a fairly fragile solution (as is our copyfrom
sources mapping, which this is based on).  If unrelated commits are
made to ReposB between loads of the incremental dumps, then the
mergeinfo source revs will be wrong; but it is *always* wrong right
now, at least this would fix this use case.  Opinions appreciated.

And yes, there are still other potential problems: Think incremental
dumps that are dumpfiltered such that some revs are dropped or
renumbered.  But I'm not sure we really want to solve *every* problem
in this space, I'm not even sure we can (without bringing the
performance of svnadmin load to its knees).  I think this patch brings
us to "good enough" with the caveat that ill-advised use of svnadmin
dump/load and/or svndumpfilter can result in incorrect mergeinfo.

Paul

Re: [RFC] Incremental dumps and mergeinfo (Was: Vetoing latest issue #3020 fix in 1.6.10)

Posted by Paul Burba <pt...@gmail.com>.

On Wed, Aug 4, 2010 at 4:11 AM, Julian Foad <ju...@wandisco.com> wrote:
> On Tue, 2010-08-03 at 14:35 -0400, Paul Burba wrote:
>> Sweeping through my TODO list turned up this thread.
>>
>> I'll commit this last patch tomorrow if there are no objections.
>
> No objections to this improvement from me.
>
> - Julian

On Wed, Aug 4, 2010 at 2:03 PM, C. Michael Pilato <cm...@collab.net> wrote:
> On 08/03/2010 02:35 PM, Paul Burba wrote:
>> Sweeping through my TODO list turned up this thread.
>>
>> I'll commit this last patch tomorrow if there are no objections.
>
> None here.

Thanks for taking a minute gents.  Committed
http://svn.apache.org/viewvc?view=revision&revision=982673

Paul

Re: [RFC] Incremental dumps and mergeinfo (Was: Vetoing latest issue #3020 fix in 1.6.10)

Posted by Julian Foad <ju...@wandisco.com>.

On Tue, 2010-08-03 at 14:35 -0400, Paul Burba wrote:
> Sweeping through my TODO list turned up this thread.
> 
> I'll commit this last patch tomorrow if there are no objections.

No objections to this improvement from me.

- Julian


> I pause a day only because I recall Mike saying he was planning to
> take a look at this.
> 
> Paul
> 
> On Tue, May 11, 2010 at 4:00 PM, Paul Burba <pt...@gmail.com> wrote:
[...]
> > At any rate, after my recent work we are left with one common(?) use
> > case still broken:
> >
> > Loading a sequence of incremental dumps when the fist load in the
> > sequence is into a *non-empty* repostitory.
> >
> > This is tested in svnadmin_tests.py 20 'don't filter mergeinfo revs
> > from incremental dump', specifically see '# PART 4: Load a a series of
> > incremental dumps to an non-empty repository."
> >
> > Basically the problem is this:
> >
> > 1) Given repository ReposA with X revisions in it.
> >
> >   Note: Previously this use case was also broken if ReposB
> >   was empty to start, but this now works (assuming none of
> >   the incremental dumps in step 3 were svndumpfiltered in
> >   such a way as to remove/renumber revisions).
> >
> > 2) Given repository ReposB with Y revisions in it.
> >
> > 3) Incrementally (but fully) dump ReposA
> >
> >   svnadmin dump ReposA -r0:100                 >  A.0.100.dump
> >   svnadmin dump ReposA -r101:200 --incremental >  A.101.200.dump
> >   svnadmin dump ReposA -r201:300 --incremental >  A.201.300.dump
> >   .
> >   .
> >   .
> >   svnadmin dump ReposA -rW:X     --incremental >  A.W.X.dump
> >
> > 4) Load the ReposA incremental dumps to ReposB
> >
> >   svnadmin load ReposB --parent-dir /some/subtree < A.0.100.dump
> >   svnadmin load ReposB --parent-dir /some/subtree < A.101.200.dump
> >   svnadmin load ReposB --parent-dir /some/subtree < A.201.300.dump
> >   .
> >   .
> >   .
> >   svnadmin load ReposB --parent-dir /some/subtree < A.W.X.dump
> >
> > The problem arises when one of the incremental loads contains
> > references to mergeinfo source revisions that predate the start of the
> > dump.  For example, say A.201.300.dump contains mergeinfo referencing
> > r82.  When this is loaded into ReposB the reference should be changed
> > to r(82+Y) to reflect the fact that ReposB has Y revisions in it
> > before we ever started loading parts of ReposA.  Currently no change
> > is made and the source rev stays at r82, which is obviously incorrect.
> >
> > The attached patch fixes this problem.  It takes all mergeinfo which
> > predates the first revision in the dump stream and adjusts it by the
> > difference between the head rev in ReposB and the current dump stream
> > revision.  Yes, this is a fairly fragile solution (as is our copyfrom
> > sources mapping, which this is based on).  If unrelated commits are
> > made to ReposB between loads of the incremental dumps, then the
> > mergeinfo source revs will be wrong; but it is *always* wrong right
> > now, at least this would fix this use case.  Opinions appreciated.
> >
> > And yes, there are still other potential problems: Think incremental
> > dumps that are dumpfiltered such that some revs are dropped or
> > renumbered.  But I'm not sure we really want to solve *every* problem
> > in this space, I'm not even sure we can (without bringing the
> > performance of svnadmin load to its knees).  I think this patch brings
> > us to "good enough" with the caveat that ill-advised use of svnadmin
> > dump/load and/or svndumpfilter can result in incorrect mergeinfo.
> >
> > Paul
> >

Re: [RFC] Incremental dumps and mergeinfo (Was: Vetoing latest issue #3020 fix in 1.6.10)

Posted by "C. Michael Pilato" <cm...@collab.net>.

On 08/03/2010 02:35 PM, Paul Burba wrote:
> Sweeping through my TODO list turned up this thread.
> 
> I'll commit this last patch tomorrow if there are no objections.

None here.

-- 
C. Michael Pilato <cm...@collab.net>
CollabNet   <>   www.collab.net   <>   Distributed Development On Demand

Re: [RFC] Incremental dumps and mergeinfo (Was: Vetoing latest issue #3020 fix in 1.6.10)

Posted by Paul Burba <pt...@gmail.com>.

Sweeping through my TODO list turned up this thread.

I'll commit this last patch tomorrow if there are no objections.

I pause a day only because I recall Mike saying he was planning to
take a look at this.

Paul

On Tue, May 11, 2010 at 4:00 PM, Paul Burba <pt...@gmail.com> wrote:
> On Tue, Apr 20, 2010 at 11:08 PM, C. Michael Pilato <cm...@collab.net> wrote:
>> Paul Burba wrote:
>>> Hi Mike and Julian,
>>>
>>> Sorry for the delayed response; been working on/thinking about other
>>> mergey stuff lately...
>>>
>>> A few inline comments and then a proposed course of action at the end.
>>>
>>> On Thu, Apr 15, 2010 at 9:28 AM, C. Michael Pilato <cm...@collab.net> wrote:
>>>> The question then becomes, "How do users deal with legitimate partial dumps
>>>> that will be loaded atop something other than loads from previous
>>>> incremental windows?"  I think they do this the same way they handle the
>>>> copy case:  either hand-hacking the dumpstream or using 'svndumpfilter'.  So
>>>> maybe 'svndumpfilter' grows the logic and options required to pare away
>>>> mergeinfo that doesn't meet some criteria (such as "ranges outside the dump
>>>> stream range")?
>>>
>>> I vote for svndumpfilter growing the logic to filter ranges predating
>>> the start of the dump stream in the same way it strips mergeinfo with
>>> merge sources filtered from the dump stream.  Basically this means
>>> moving the r927243 functionality to svndumpfilter.  No really sure we
>>> need a new option, seems we could overload
>>> --skip-missing-merge-sources to include this behavior.  Thoughts?
>>
>> +1
>>
>>>> Julian Foad wrote:
>>>>> On Wed, 2010-04-14, C. Michael Pilato wrote:
>>>>>> Assuming similar behavior for mergeinfo handling, we have, at a minimum:
>>>>>>
>>>>>> 1.  'svnadmin dump' warns when it is in incremental mode and must generate
>>>>>> mergeinfo from a merge source that predates the beginning of the dump
>>>>>> window, but it's only a warning and the dump continues.
>>>
>>> Why only --incremental mode?  We should warn for both --incremental
>>> dumps *and* non-incremental dumps because in the latter case we might
>>> be specifying a dump range -rX:Y where rX>0* and there might be
>>> mergeinfo that refers to revs < X.  And while we might want to be
>>> clever about skipping the warning when no range is specified (i.e.
>>> defaulting to -r0:HEAD), even then we should warn because issue #3020
>>> might already have resulted in a r1 with mergeinfo in the starting
>>> repository.
>>
>> You're right, of course.  I was thinking "partial dump" and saying
>> "incremental".  My bad.
>>
>>>>>> 2.  'svnadmin load' does nothing smart, trusting that the dump it's being
>>>>>> fed is a sensible one.
>>>>> Better: 'svnadmin load' tries to validate the mergeinfo, and issues a
>>>>> warning if it refers to a non-existent source.  The admin then gets to
>>>>> figure out whether it was bad in the first place or because of his/her
>>>>> partial-dump/partial-load scenario.
>>>>>
>>>>> However, it depends how efficient the checking is.  If that would make
>>>>> the 'load' really slow, I can see that not being wanted.
>>
>> [...]
>>
>>> No unsurprisingly, this took significantly (62%) longer, 00:05:39.609.
>>
>> (My vote was for 'svnadmin load' does nothing smart, and I stand by it.)
>>
>>> Taking what you've had to say into consideration, I propose the following:
>>>
>>> 1) Revert http://svn.apache.org/viewvc?view=revision&revision=927243.
>>> This stops svnadmin load from automatically filtering mergeinfo
>>> referring to revisions outside of the load stream.
>>>
>>> 2) Reapply or reimplement the part of r327243 that accounts for the
>>> case where the first loaded revision in a load stream has mergeinfo,
>>> see http://subversion.tigris.org/issues/show_bug.cgi?id=3020#desc10
>>> item #1.
>>
>> (r927243, you mean.)
>>
>>> 3) Keep the inline copy-source warnings as they exist today in
>>> svnadmin dump, but add a new end-of-process summary warning e.g.:
>>>
>>>   * Dumped revision 504.
>>>   * Dumped revision 505.
>>>   * Dumped revision 506.
>>>   * Dumped revision 507.
>>>   WARNING: Referencing data in revision 250, which is older than the oldest
>>>   WARNING: dumped revision (504).  Loading this dump into an empty repository
>>>   WARNING: will fail.
>>>   * Dumped revision 58.
>>>   * Dumped revision 59.
>>>   * Dumped revision 510.
>>>   * Dumped revision 511.
>>>   .
>>>   .
>>>   .
>>>   WARNING: The range of revisions dumped contained references to copy
>>> sources outside that range.
>>>
>>> 4) Implement inline and summary functionality analogous to 3) but for
>>> mergeinfo ranges outside of the dump stream, e.g.:
>>>
>>>   * Dumped revision 504.
>>>   * Dumped revision 505.
>>>   * Dumped revision 506.
>>>   * Dumped revision 507.
>>>   WARNING: Mergeinfo referencing revision(s) prior to revision 250,
>>> which is older
>>>   WARNING: than the oldest dumped revision (504).  Loading this dump may result
>>>   WARNING: in invalid mergeinfo.
>>>   * Dumped revision 58.
>>>   * Dumped revision 59.
>>>   * Dumped revision 510.
>>>   * Dumped revision 511.
>>>   .
>>>   .
>>>   .
>>>   WARNING: The range of revisions dumped contained references to
>>> mergeinfo outside that range.
>>>
>>> Also, as mentioned earlier, I think this warning should kick in
>>> regardless of whether --incremental is used.
>>>
>>> 5) Overload svndumpfilter's --skip-missing-merge-sources option to
>>> also remove mergeinfo that predates the starting revision of the dump
>>> stream.
>>
>> +1 all over this stuff.
>>
>>> 6) Open: How does svnadmin load deal with mergeinfo that doesn't exist
>>> in the load steam or the target repos:
>>>
>>>    A) It doesn't.
>>>
>>>    B) Only checks that the revs are valid (fast, but
>>>       can allow mergeinfo that describes invalid
>>>       mergesource:rev pairs)
>>>
>>>    C) Find an efficient way to test for valid
>>>       mergesource:rev pairs
>>
>> For now, A (or maybe B).  Then we have our Paul back.  :-)
>>
>
> Been down the rabbit hole on-and-off the last couple weeks with this.
> Did everything we discussed so far in this thread, plus a slew of new
> bugs, see http://subversion.tigris.org/issues/show_bug.cgi?id=3020#desc15
> and onward.
>
> At any rate, after my recent work we are left with one common(?) use
> case still broken:
>
> Loading a sequence of incremental dumps when the fist load in the
> sequence is into a *non-empty* repostitory.
>
> This is tested in svnadmin_tests.py 20 'don't filter mergeinfo revs
> from incremental dump', specifically see '# PART 4: Load a a series of
> incremental dumps to an non-empty repository."
>
> Basically the problem is this:
>
> 1) Given repository ReposA with X revisions in it.
>
>   Note: Previously this use case was also broken if ReposB
>   was empty to start, but this now works (assuming none of
>   the incremental dumps in step 3 were svndumpfiltered in
>   such a way as to remove/renumber revisions).
>
> 2) Given repository ReposB with Y revisions in it.
>
> 3) Incrementally (but fully) dump ReposA
>
>   svnadmin dump ReposA -r0:100                 >  A.0.100.dump
>   svnadmin dump ReposA -r101:200 --incremental >  A.101.200.dump
>   svnadmin dump ReposA -r201:300 --incremental >  A.201.300.dump
>   .
>   .
>   .
>   svnadmin dump ReposA -rW:X     --incremental >  A.W.X.dump
>
> 4) Load the ReposA incremental dumps to ReposB
>
>   svnadmin load ReposB --parent-dir /some/subtree < A.0.100.dump
>   svnadmin load ReposB --parent-dir /some/subtree < A.101.200.dump
>   svnadmin load ReposB --parent-dir /some/subtree < A.201.300.dump
>   .
>   .
>   .
>   svnadmin load ReposB --parent-dir /some/subtree < A.W.X.dump
>
> The problem arises when one of the incremental loads contains
> references to mergeinfo source revisions that predate the start of the
> dump.  For example, say A.201.300.dump contains mergeinfo referencing
> r82.  When this is loaded into ReposB the reference should be changed
> to r(82+Y) to reflect the fact that ReposB has Y revisions in it
> before we ever started loading parts of ReposA.  Currently no change
> is made and the source rev stays at r82, which is obviously incorrect.
>
> The attached patch fixes this problem.  It takes all mergeinfo which
> predates the first revision in the dump stream and adjusts it by the
> difference between the head rev in ReposB and the current dump stream
> revision.  Yes, this is a fairly fragile solution (as is our copyfrom
> sources mapping, which this is based on).  If unrelated commits are
> made to ReposB between loads of the incremental dumps, then the
> mergeinfo source revs will be wrong; but it is *always* wrong right
> now, at least this would fix this use case.  Opinions appreciated.
>
> And yes, there are still other potential problems: Think incremental
> dumps that are dumpfiltered such that some revs are dropped or
> renumbered.  But I'm not sure we really want to solve *every* problem
> in this space, I'm not even sure we can (without bringing the
> performance of svnadmin load to its knees).  I think this patch brings
> us to "good enough" with the caveat that ill-advised use of svnadmin
> dump/load and/or svndumpfilter can result in incorrect mergeinfo.
>
> Paul
>