You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Julian Foad <ju...@btopenworld.com> on 2014/12/19 13:23:11 UTC

Symmetry between dump and load

I believe the following symmetries should be true, and testable, and we should test them.

For any valid repository:

  * we can dump it
  * we can load the dump file into a new repository
  * the new repo is equivalent to the old repo

For any valid dump file:

  * we can load it into a new repository
  * we can dump that repository
  * the new dump file is equivalent to the old dump file


WHY?

This thought was triggered after noticing that we keep finding more and more asymmetries (that is, bugs) in dump and load. Most of the ones I have paid attention to are related to mergeinfo. Examples:

  #3912 svnadmin load does fail to process dumps with non UTF-8 path names
  #4414 dump/load with invalid mergeinfo
  #4476 Mergeinfo containing r0 makes svnsync and dump and load fail
  #4492 svnrdump load assertion failure if Node-path starts with a slash
  #4538 'load' strips r1 references in mergeinfo
  #4539 Need a way to 'load' a dump without munging mergeinfo
  #4573 mergeinfo parsing inconsistency: empty path

Why does this matter? Users care about stability. Waiting for a bug to show up, fixing it, and adding a regression test for that particular case gets us only so far. We could be pro-active, and go looking for these sorts of bugs much more aggressively. I think we should.

Why should we declare that these symmetries hold? Because we defined dump and load to be the canonical (or "lowest common denominator") back-up mechanism: its whole purpose is to represent the content of a repository unambiguously and completely and transfer that content to a different repository. (Oops, it fails in the "completely" department: it doesn't represent locks, for one thing.) And because we rely on these symmetries in our understanding and maintenance of the software.

Why should these symmetries be so tight that they can be mechanically tested, without an unmanageable number of intentional differences? Because we can't produce solid software if we can't test it!


HOW?

The meanings of "valid" and "equivalent" will need to be defined carefully. Here are some starting points for definitions.

"valid repository":
  The result of any combination of:

  * calling any libsvn_repos or higher level APIs, even with bad parameters and including calls that fail;
  * calling APIs below libsvn_repos, in appropriate ways, with appropriate parameters and taking appropriate action if calls fail;
  * starting with a "valid repository" produced by an older released version of Subversion, even if we consider that version to be buggy.

"valid dump file":
  Any file that can be loaded without the loader throwing an error.

"equivalent repositories"
  * when queried through libsvn_repos or higher level APIs, yield identical results; and
  * when dumped, yield identical dump files.

"equivalent dump files"
  * when loaded, yield equivalent repositories.


FUZZING

How can we possibly test all valid repositories and all valid dump files? Not by hand-crafted test cases, that's certain. However, the technique of repeatable, pseudo-random testing, aka "fuzzing", can enable us to approach closer and closer to complete test coverage, the more time we throw at it. Forget the idea that a test case has to have a predetermined coverage and has to run to completion every time we run "the tests". Instead, when run as part of the normal test suite, this "fuzzer" would generate a small number of test cases from pseudo-random inputs, and run them. These would be different each time it runs.

The "repeatable" part is that, whenever a generated test case fails, the parameters would be logged in a way that allows that specific case to be re-generated. Then it can be examined, re-tested against different builds, and, if it detected a real bug, inserted into the test suite as a separate, static regression test to be run every time.

The test code would also have a mode that tells it to keep generating and running pseudo-random test cases for a long or unlimited time.


OTHER SYMMETRIES

Subversion is quite rich in symmetries, more so than some other software because its job is to preserve data.

  * svnrdump dump and load should be symmetrical. They should also be equivalent to svnadmin dump and load respectively, except as modified by RA layer constraints.

  * svnsync should directly create an equivalent repository.

  * Any query to a write-through proxy should return the same result as querying the master.

  * Most of the Subversion library APIs have read and write interfaces which should be (broadly) symmetrical. Major ones include FSFS; FS; repos; delta; diff(+patch); RA; and to some extent WC.

  * Many low-level two-way conversions should be symmetrical: reading/writing config files, parsing/unparsing mergeinfo.

  * Getting more advanced... Any change or series of changes committed to 'trunk', we should be able to commit instead to a branch and then merge to trunk. If there were no changes (or no conflicting changes) made on trunk in the meantime, the end result should be identical.

  * 'svn diff -rX:Y' and 'svn diff 'rY:X' should be mirror images.

  * and many more!



Thoughts?

- Julian

Re: Symmetry between dump and load

Posted by Mark Phippard <ma...@gmail.com>.
On Fri, Dec 19, 2014 at 7:23 AM, Julian Foad <ju...@btopenworld.com>
wrote:

> I believe the following symmetries should be true, and testable, and we
> should test them.
>
> For any valid repository:
>
>   * we can dump it
>   * we can load the dump file into a new repository
>   * the new repo is equivalent to the old repo
>
> For any valid dump file:
>
>   * we can load it into a new repository
>   * we can dump that repository
>   * the new dump file is equivalent to the old dump file
>
>
This is just a drive-by comment ..

Just wanted to add that Locks are completely lost by a dump/load process.

-- 
Thanks

Mark Phippard
http://markphip.blogspot.com/

Issue #4544, Symmetry between dump and load

Posted by Julian Foad <ju...@btopenworld.com>.
This is now filed as issue #4544, "Symmetry between dump and load":
http://subversion.tigris.org/issues/show_bug.cgi?id=4544

- Julian

Re: Symmetry between dump and load

Posted by Julian Foad <ju...@btopenworld.com>.
Branko Čibej <br...@wandisco.com>
> Julian Foad wrote:
>> I believe the following symmetries should be true, and testable, and we 
>> should test them.
>> 
>>  For any valid repository:
>> 
>>    * we can dump it
>>    * we can load the dump file into a new repository
>>    * the new repo is equivalent to the old repo
>> 
>>  For any valid dump file:
>> 
>>    * we can load it into a new repository
>>    * we can dump that repository
>>    * the new dump file is equivalent to the old dump file
> 
> I agree that this should be our goal. However, consider that some of
> these symmetries depend on specific features of the repository
> implementation.
> 
> For example, at some point you mentioned dump files with non-UTF-8
> paths. Such dump files are clearly invalid, since we've maintained the
> restriction that all strings used internally must be encoded in UTF-8.
> So, such a dump file can only be the result of manual fiddling, or a bug
> in some version of some repository back-end implementation. A different
> and/or fixed backend will not accept non-UTF-8 paths at all; thus, we
> cannot maintain this particular symmetry.

Yes, exactly. By testing, we could discover this kind of problem. The solution to this kind of issue is not necessarily that we have to prioritize total symmetry across all versions and implement 'fixes'; rather, part of the goal of testing is to discover such asymmetries so that we can be aware of them, document them, and decide what further action to take if any, which may be accepting the asymmetry and adjusting the testing if necessary to account for it.

> Conversely, if we decide that maintaining strict dump/load symmetry is
> more important, we're—unnecessarily, IMO—complicating future development
> (e.g., the idea that repos path lookup should preserve but ignore
> differences in Unicode character representation).

I don't propose to maintain strict symmetry in all cases. The point is to *discover* issues, to make them visible, and then decide, for each issue, whether we should declare it a bug to be fixed or accept and document the asymmetry and adjust the tests for it.

> I'm sure there are other cases where maintaining strict symmetry will
> turn out to be too constraining. An example from your own bailiwick:
> when we store mergeinfo in a more reasonable structure than a versioned
> property, a load from an older dumpfile will most likely loose details
> of exactly how the mergeinfo was represented; even though a later dump
> may produce svn:mergeinfo values that are different but semantically
> equivalent to the original.

Yes, sure, that's an entirely reasonable course.

> Clearly, dump/load asymmetry can be preserved even in the cases I
> mentioned, at the cost of maintaining more complex medatada (and related
> code) in the repository back-end. The question we have to answer is:
> what's the point, as long as semantics are not affected?

The point is not that strict symmetry is the number 1 priority, but rather that we use a general goal of symmetry to help us find and address problems.

- Julian


Re: Symmetry between dump and load

Posted by Branko Čibej <br...@wandisco.com>.
On 19.12.2014 13:23, Julian Foad wrote:
> I believe the following symmetries should be true, and testable, and we should test them.
>
> For any valid repository:
>
>   * we can dump it
>   * we can load the dump file into a new repository
>   * the new repo is equivalent to the old repo
>
> For any valid dump file:
>
>   * we can load it into a new repository
>   * we can dump that repository
>   * the new dump file is equivalent to the old dump file


I agree that this should be our goal. However, consider that some of
these symmetries depend on specific features of the repository
implementation.

For example, at some point you mentioned dump files with non-UTF-8
paths. Such dump files are clearly invalid, since we've maintained the
restriction that all strings used internally must be encoded in UTF-8.
So, such a dump file can only be the result of manual fiddling, or a bug
in some version of some repository back-end implementation. A different
and/or fixed backend will not accept non-UTF-8 paths at all; thus, we
cannot maintain this particular symmetry.

Conversely, if we decide that maintaining strict dump/load symmetry is
more important, we're—unnecessarily, IMO—complicating future development
(e.g., the idea that repos path lookup should preserve but ignore
differences in Unicode character representation).

I'm sure there are other cases where maintaining strict symmetry will
turn out to be too constraining. An example from your own bailiwick:
when we store mergeinfo in a more reasonable structure than a versioned
property, a load from an older dumpfile will most likely loose details
of exactly how the mergeinfo was represented; even though a later dump
may produce svn:mergeinfo values that are different but semantically
equivalent to the original.


Clearly, dump/load asymmetry can be preserved even in the cases I
mentioned, at the cost of maintaining more complex medatada (and related
code) in the repository back-end. The question we have to answer is:
what's the point, as long as semantics are not affected?

-- Brane