You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@subversion.apache.org by Julian Foad <ju...@btopenworld.com> on 2009/02/04 22:59:27 UTC

Re: Comment on obliterate functional specification

Magnus,

As I saw no other response, I'll just speak up and mention that your
proposal sounds extremely sensible. I haven't followed the previous
history or proposals of the feature, though.

Of course, the test of whether your proposal really does simplify the
whole thing is in the details of what OBLITERATION SETS are needed, and
how they can be constructed, to satisfy the end-user goals.

Can I encourage you to submit a patch to the document, that incorporates
your proposal and at least makes a start on describing how it is used to
solve the goals? Or write a second proposal that we can check in beside
the existing one.

- Julian


On Tue, 2009-01-20 at 20:14 -0800, Magnus wrote:
> I have been going through the discussion of the obliterate feature, which,
> although it has tended to start and stop, has now found a home in the
> functional spec.
> (trunk/notes/obliterate/obliterate-functional-spec.txt)
> 
> The design of this behavior is not trivial, but although I believe that
> the svnsync approach suggested by Karl Fogel (and others, off-list, if I
> understand correctly) result in a much more feasible design, I do not 
> completely agree with his comments (on dev) that there are many possible 
> ways in which it could behave. More specifically, I believe that different
> obliteration use cases all need to be built around a core obliteration 
> functionality, and that there really is only one good option for 
> implementing that core in a  way which does not lead into a 
> quagmire of ill-defined outcomes.
> 
> Using the language of the specification, (along with a new concept,
> that of an OBLITERATION SET) this core consists of:
> 
> 
> 1: SELECT multiple modifications. These modifications comprise the 
>    OBLITERATION SET in the form of multiple PATH@REV pairs.
> 
> 2: OBLITERATE selected modifications.
> 
> 
> Short and sweet :-)
> 
> Three observations result from this way of viewing the matter, the first 
> of which is crucial in my view, the others are "convenience observations"
> 
> A: The data of a PATH@REV that does NOT intersect with the OBLITERATION SET  
>    is UNCHANGED by an obliteration. Always. History data may change when an
>    ancestor of the PATH@REV has been obliterated, but:
>      svn co REPO\PATH@REV LOCALPATH
>    results in EXACTLY the same working copy when REPO is the 
>    post-obliterate repository as when it is the original repository.
>    
> B: There is no "obliteration of files" that is independent of the 
>    obliteration of modifications. To "obliterate a file" (or directory), 
>    one simply has to obliterate every single modification to that file. 
>    Thus, if a file needs to be completely obliterated, this can be done 
>    by specifying a PATH@REV, finding all ancestors, direct descendants 
>    (and optionally copied descendants), and including each of them 
>    in the OBLITERATION SET.
>    
> C: There is no "obliteration of revisions" that is independent of the 
>    obliteration of modifications. To "obliterate a revision", one simply 
>    has to obliterate the modifications in the OBLITERATION SET implied by
>    "/@REV".
> 
> If point A is agreed on, I believe that the functional specification could
> be simplified quite a bit, with a main section on how to implement 
> functionality consistent with A, and additional sections on how to 
> implement specific use cases through the construction of the 
> OBLITERATION SET in different ways.
> 
> I would appreciate any comments on this, and if others concur with this
> view, I might contribute a patch to the functional-spec with some edits
> reflecting this approach.
> 
> Best,
> Magnus
> 
> ps. I posted this earlier today but it seems to have disappeared. I'm terribly sorry if I am double-posting.
> 
> ------------------------------------------------------
> http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=1040326

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=1104601

Re: Comment on obliterate functional specification

Posted by "Magnus ..." <zu...@gmail.com>.

No, I don't believe this has been archived anywhere
outside the dev list. But not to worry, I'm still on
the case. I submitted this text as a patch to the
functional specification. See: 
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=1135837

And I will make sure that any insights from this discussion
get either committed or explicitly rejected before my
involvement ends.

In fact, I have two more installments ready in this little
obliteration series, but I've been hesitant to compete for
bandwidth with the effort to get 1.6 out the door. I'll
probably post them in the near future and repost the patch
I submitted as well.

Best,
Magnus

Daniel Shahaf wrote:
 >
 > Have we archived this somewhere?  On issue #516, or in the notes/
 > directory, etc.?
 >
 > Daniel
 >
 >> -------------------------------------------------
 >>
 >> DEFINITION OF THE OBLITERATION OPERATION
 >>
 >> ...
 >>
 >>

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=1205597

Re: Comment on obliterate functional specification

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Magnus wrote on Thu, 5 Feb 2009 at 08:30 -0800:
> Thanks for the encouragement, Julian. As a matter of fact, I
> had written up more on the definition, but had intended to hold
> off until after relese 1.6, assuming that things would ease up 
> after that. However, I will send what I have prepared now,
> and would welcome any comments.
> 
> The following text would belong somewhere early in a revised
> functional specification for obliteration:
> 

Have we archived this somewhere?  On issue #516, or in the notes/ 
directory, etc.?

Daniel

> -------------------------------------------------
> 
> DEFINITION OF THE OBLITERATION OPERATION
> 
>   An OBLITERATION SET is defined by a list of PATH@REVISON elements 
>   (that is, each element is a pair, consisting of a PATH and REVISION). 
>   The same PATH can be paired with multiple REVISIONS to form 
>   multiple elements and vice versa.
> 
>     Note: The set is restricted so that if, for a given REVISION, 
>     PATH@REVISION is part of the OBLITERATION SET, any element of 
>     the of the form [PATH/RELATIVEPATH]@REVISION is also part of 
>     the set. (This simply means that if a directory change is 
>     obliterated in a revision, all changes to its contents must 
>     also be obliterated in the same revision).
>     [Note on the note. Perhaps this restriction can be lifted.
>     However, it seems that doing so would greatly complicate
>     both the behavior and implementation of the operation,
>     without much benefit.]
> 
>   An ORIGINAL repository is a repository to which an OBLITERATION 
>   operation could be applied, but has not (this includes any 
>   subversion repository without obliterations).
> 
>   A MODIFIED repository is a repository which is identical to the 
>   ORIGINAL but for which an OBLITERATION SET has been defined and 
>   an OBLITERATION operation has been applied.
> 
>   The OBLITERATION operation is defined by the following two properties:
> 
>     1. If a PATH@REVISION is checked out of the MODIFIED repository,  
>        and the PATH@REVISION is NOT in the OBLITERATION SET, the 
>        checkout data is identical to what would have been returned 
>        if PATH@REVISION had been checked out of the ORIGINAL.
>        
>     2. If a PATH@REVISION is checked out of the MODIFIED repository,  
>        and the PATH@REVISION IS in the OBLITERATION SET, the 
>        checkout data is identical to what would have been returned 
>        if PATH@REVPRIOR had been checked out of the ORIGINAL, where 
>        REVPRIOR is the last revision prior to REVISION for which
>        PATH@REVPRIOR is not in the OBLITERATION SET.
>        
>     3. Any other mechanism through which a user can interact with
>        the repository (diff/merge/copy/commit/etc) should work
>        consistently. That is, assume that a REFERENCE repository 
>        existed from which nothing had been obliterated, but for 
>        which any checkout operation yielded the same data as for the 
>        MODIFIED repository. Then every remote interaction with
>        MODIFIED must yield a result indistinguishable from what 
>        would happen if the same operation were applied to the 
>        REFERENCE repository.
>        
>      Note: Here, data refers to the reported existence of the path,
>      the versioned properties that apply to the path, and for files,
>      the actual contents of the file.
>      
>      Note: This definition does not state what happens to  
>      revision properties (several options are available), and it
>      does not state what happens to the reported history of 
>      the path (again, several options are available).
>      
>      Note: Implicit in the above is the fact that the core 
>      OBLITERATION functionality would not drop empty revisions. 
>      This is intentional, and dropping empty revisions should be
>      done through a separate mechanism.
>      
> -------------------------------------------------     
> 
> The above definition fulfills several desirable criteria:
>  * It is in my view parsimonious
>  * It is relatively short
>  * It has clearly defined behavioral implications
>  
> However, the make-or-break criteria are of course two:
>  * Can obliteration, as defined above, be feasibly implemented?
>  * Would such an implementation address all required use-cases?
>  
> I believe the answer to both of the above questions to be yes,
> and I would be happy to elaborate on why I believe this to 
> be the case, through discussions on the mailing list and through
> patches to the functional specification.
> 
> Best regards,
> Magnus
> 
> ------------------------------------------------------
> http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=1108134
>

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=1200601

Re: Comment on obliterate functional specification

Posted by ac...@zulutime.net.

Agreed, as I mentioned this is intended "somewhere early in a 
revised functional specification", it is not supposed to be the 
complete specification.

The two issues you mention do indeed need to be agreed on. 
However, I would like to note here that in my opinion:

1. Given a MODIFIED repository, property 3 of the definition
   quaranties that space can be reclaimed and data completely
   removed from the repository through a svnsync->replace cycle.
   (Since svnsync must by definition see the repository as
   it would have been had the obliterated revisions never
   been committed).

2. The sync->replace cycle approach to obliteration was 
   originally suggested by Karl Fogel (although I believe 
   he envisioned the obliteration logic as residing in the 
   sync-type program itself, rather than using svnsync as-is).

   For an utter and complete removal from the repository, I 
   believe that such a sync->replace cycle (perhaps made 
   transparent by encapsulating it in a single operation)
   is the best that can be achieved without vastly 
   complicating the operation. The reason:
     
     If the repository is traversed from revision 0 up to HEAD, it
     is close to impossible to know if a (much) later revision
     includes paths copied from the obliterated sets (in which
     case the data would be lost). Thus, the operation would first
     need to compile a list of every single copy operation, check 
     every obliteration against this list, and store the data 
     somewhere before removing it from the repository.
     
     If the repository is traversed from revision HEAD down to 0,
     it will, upon encountering an obliterated modification at 
     revision N, merge that modification with whatever happened
     at revision N+1. If it then encounters a modification of the
     same path at revision N-1, it will have to go back to
     N+1 to merge the N-1 mod with the joint N and N+1 mod. This
     means that in doing the obliteration, a revision can not
     be finalized until the obliteration is over. Furthermore,
     whenever a copy is encountered, logic must be applied to 
     figure out where it should originate AFTER obliterations 
     are performed, and it rewritten. (This might require it to
     be rewritten in a way that is not consistent with the 
     earlier revision as it is occurs BEFORE modification)
     
     Thus to sum up: Any "true" obliteration mechanism
     that does not have access to the complete data of the 
     original repository during the whole obliteration operation
     will become hopelessly nonlinear in any scenario, and close
     to impossible to implement on a live repository. My functional
     specification will therefore not require such an approach.

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=1110077

Re: Comment on obliterate functional specification

Posted by Philipp Marek <ph...@emerion.com>.

Hello David,

On Freitag, 6. Februar 2009, David Glasser wrote:
> That's a good write-up, but it doesn't handle the other big design
> decisions for obliterate: whether it's acceptable for the data to be
> reconstructible by somebody with direct access to the repository, and
> whether it's acceptable for space to not be reclaimed after
> obliterate.
>
> (For FSFS in particular, the answer to these questions hugely
> constrains implementation alternatives, since node IDs include the
> offset in a rev file.)
Of course, just changing a few pointers so that the obliterated data becomes 
unreachable is fine as a fast operation that can be done while running 
normally. (Apart from destroying some users' working copies, if they currently 
have some to-be-obliterated data checked out, and where the next update would 
[probably] result in wrong data.)

But I think there should be at least some way (eg. by using "svnadmin pack") 
to reclaim the space.
Wiping the data (with a single pass of zeroes) might work for some people, 
too, but as there's no easy way to punch holes in files (ie. make them sparse 
in the middle) the space would be lost.

About the node IDs: How about some kind of "svnadmin pack -rX", that keeps all 
offsets intact (to avoid having to change *a lot* of revisions), but skips 
blocks where possible to make the file sparse? Sounds like an easy way for me.

Regards,

Phil

Re: Comment on obliterate functional specification

Posted by David Glasser <gl...@davidglasser.net>.

That's a good write-up, but it doesn't handle the other big design
decisions for obliterate: whether it's acceptable for the data to be
reconstructible by somebody with direct access to the repository, and
whether it's acceptable for space to not be reclaimed after
obliterate.

(For FSFS in particular, the answer to these questions hugely
constrains implementation alternatives, since node IDs include the
offset in a rev file.)

--dave

On Thu, Feb 5, 2009 at 8:30 AM, Magnus <ac...@zulutime.net> wrote:
> Thanks for the encouragement, Julian. As a matter of fact, I
> had written up more on the definition, but had intended to hold
> off until after relese 1.6, assuming that things would ease up
> after that. However, I will send what I have prepared now,
> and would welcome any comments.
>
> The following text would belong somewhere early in a revised
> functional specification for obliteration:
>
> -------------------------------------------------
>
> DEFINITION OF THE OBLITERATION OPERATION
>
>  An OBLITERATION SET is defined by a list of PATH@REVISON elements
>  (that is, each element is a pair, consisting of a PATH and REVISION).
>  The same PATH can be paired with multiple REVISIONS to form
>  multiple elements and vice versa.
>
>    Note: The set is restricted so that if, for a given REVISION,
>    PATH@REVISION is part of the OBLITERATION SET, any element of
>    the of the form [PATH/RELATIVEPATH]@REVISION is also part of
>    the set. (This simply means that if a directory change is
>    obliterated in a revision, all changes to its contents must
>    also be obliterated in the same revision).
>    [Note on the note. Perhaps this restriction can be lifted.
>    However, it seems that doing so would greatly complicate
>    both the behavior and implementation of the operation,
>    without much benefit.]
>
>  An ORIGINAL repository is a repository to which an OBLITERATION
>  operation could be applied, but has not (this includes any
>  subversion repository without obliterations).
>
>  A MODIFIED repository is a repository which is identical to the
>  ORIGINAL but for which an OBLITERATION SET has been defined and
>  an OBLITERATION operation has been applied.
>
>  The OBLITERATION operation is defined by the following two properties:
>
>    1. If a PATH@REVISION is checked out of the MODIFIED repository,
>       and the PATH@REVISION is NOT in the OBLITERATION SET, the
>       checkout data is identical to what would have been returned
>       if PATH@REVISION had been checked out of the ORIGINAL.
>
>    2. If a PATH@REVISION is checked out of the MODIFIED repository,
>       and the PATH@REVISION IS in the OBLITERATION SET, the
>       checkout data is identical to what would have been returned
>       if PATH@REVPRIOR had been checked out of the ORIGINAL, where
>       REVPRIOR is the last revision prior to REVISION for which
>       PATH@REVPRIOR is not in the OBLITERATION SET.
>
>    3. Any other mechanism through which a user can interact with
>       the repository (diff/merge/copy/commit/etc) should work
>       consistently. That is, assume that a REFERENCE repository
>       existed from which nothing had been obliterated, but for
>       which any checkout operation yielded the same data as for the
>       MODIFIED repository. Then every remote interaction with
>       MODIFIED must yield a result indistinguishable from what
>       would happen if the same operation were applied to the
>       REFERENCE repository.
>
>     Note: Here, data refers to the reported existence of the path,
>     the versioned properties that apply to the path, and for files,
>     the actual contents of the file.
>
>     Note: This definition does not state what happens to
>     revision properties (several options are available), and it
>     does not state what happens to the reported history of
>     the path (again, several options are available).
>
>     Note: Implicit in the above is the fact that the core
>     OBLITERATION functionality would not drop empty revisions.
>     This is intentional, and dropping empty revisions should be
>     done through a separate mechanism.
>
> -------------------------------------------------
>
> The above definition fulfills several desirable criteria:
>  * It is in my view parsimonious
>  * It is relatively short
>  * It has clearly defined behavioral implications
>
> However, the make-or-break criteria are of course two:
>  * Can obliteration, as defined above, be feasibly implemented?
>  * Would such an implementation address all required use-cases?
>
> I believe the answer to both of the above questions to be yes,
> and I would be happy to elaborate on why I believe this to
> be the case, through discussions on the mailing list and through
> patches to the functional specification.
>
> Best regards,
> Magnus
>
> ------------------------------------------------------
> http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=1108134
>



-- 
glasser@davidglasser.net | langtonlabs.org | flickr.com/photos/glasser/

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=1109763

Re: Comment on obliterate functional specification

Posted by Magnus Torfason <zu...@gmail.com>.

Branko Cibej wrote:
 > Consider your example of a "bad" comment in the code -- you do want to
 > find all the versions of the file in HEAD (all branches and tags, too)
 > that contain the offending text, an automated relatedness search will
 > help there. But then you have to fix all those variants (perhaps by
 > applying the same patch to all) and likely *not* obliterate the fixed
 > versions, only the ones from your original list of relatives. An
 > automated obliterate-by-bloodline would happily kill off your latest
 > fixed versions in HEAD too. :)

I agree with your general analysis, as well as your comments in
following email on the great fuzziness of allowing the system to
retroactively edit a file's contents throughout its history. On that
note, I would like to point out that the functional specification
already contains the following text (from before I started messing
around with it):

   "The lowest level of modification we should consider is the change
   to a file or directory committed in a specific revision.
   (Read: no need to support obliterating a single line in a document)"

And I think we absolutely should not allow the "modifying" history
(as contrasted with "erasing" history) use-case to enter into the
specification. (Read: no retroactive applying of patches to
non-head revisions)

 > (Which raises another interesting question: what happens to object
 > relatedness if you obliterate key links in the revision tree?)

Yes, this is very interesting and important.

If the obliteration does not affect the existence of a source
path@rev, my view is that a copy from path@rev should continue to
originate from the same path@rev. The delta needs to change, but as
subversion already allows copy+modify to occur in a single commit,
this does not seem like a problem to me.

If the existence of path@rev changes with obliteration (i.e. path@rev
disappears), then the simplest thing is to just let the copy get
converted to an add. This is what svnsync does currently in all cases,
as a former post of mine demonstrates:
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=1234159

That behavior could be improved in my opinion, in the following manner:
If the copy from path@rev becomes obsolete, but a copy from an earlier
path@rev in the object's history is possible (typically because it has
been copied before), then the copy should be rewritten to come from
the latest previous path@rev that exists in the repo.

If the copy source disappears in the obliteration, and all its prior
history is being obliterated as well, I think there is no real option
other than to convert the copy to a plain add. I've toyed around with
ideas where the copy direction would get switched so that when there
was originally an A->B copy, but then the initial part of A's history
gets obliterated so that B comes into existence first, the addition of
A would be recorded as a B->A copy. But that is just to ugly to
consider seriously (IMHO).

I do like this discussion, I feel that a lot of ambiguity about the
obliteration functionality is getting cleared up here. I realize that
we are still in the middle of a big release, but I do hope
that a level of agreement can be reached, which can then be codified
into the functional specification (I volunteer to do the
codification), and (if we are lucky) into implementation notes.

Best,
Magnus

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=1263069

Re: Comment on obliterate functional specification

Posted by Branko Cibej <br...@xbc.nu>.

Magnus Torfason wrote:
> However, even in the disk-space story, data *is* destroyed
> (imagine a file that was deleted because it was not useful at the
> time, and "hey, it's all in subversion, so I can just keep my
> directory clean without having to worry"). Someone naively running
> "svn archive" and then wanting to restore an old file might be in
> for a nasty surprise.
>   

I'd expect "svn archive" to do exactly that -- split old stuff out of
the repo, but archive it in a way that keeps it marginally accessible,
so that ancient archived data can be reconstructed.

-- Brane

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=1278537

Re: Comment on obliterate functional specification

Posted by Magnus Torfason <zu...@gmail.com>.

On 3/4/2009 6:04 PM, Jack Repenning wrote:

 > [In the disc-space story,] we want to remove the space no longer
 > in use for any path/revs that should remain available post-
 > obliteration, but the space that makes up some ancient delta
 > which is still in use, post-obliteration, we should not remove.
 > That is: a post-obliterate checkout of path@HEAD should show the
 > same result as it did before obliteration, even if the post-
 > obliteration checkout includes some text which was introduced
 > into the repository during some now-removed revision.

I agree 100%.

 > [...] the "security" story, [...] wants to remove *information* even
 > from current versions.
 >
 > ...
 >
 > - "security" wants to remove information, requires absolute removal
 > throughout all revisions, and is willing to sacrifice working copy
 > continuity.

This is very true. I'll admit that the "security" story is a bit
further from my day-to-day reality than the "disk-space" story.
However, I've been working on a writeup of a use-case that I
envision, along with the work flow to resolve it using what's
in the functional specification.

I hope to post that in the relatively near future.

 > It's remarkably hard for me to think of these two things as the
 > same operation! I would call the "disc space story" something
 > else, "archive," because as a practical matter all our customers
 > keep asking us for this function, and they always call it
 > "archive." I would leave the name "obliterate" for the "security
 > story," because though relatively few of our customers ever
 > mention this, when it comes up, that's the sort of term they
 > use for it.

I get where you're coming from. I think the idea is that since the
two both involve changing old revs in the repo, they belong together
in implementation, even if that would not rule out differing
user interfaces.

However, even in the disk-space story, data *is* destroyed
(imagine a file that was deleted because it was not useful at the
time, and "hey, it's all in subversion, so I can just keep my
directory clean without having to worry"). Someone naively running
"svn archive" and then wanting to restore an old file might be in
for a nasty surprise.

Best,
Magnus

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=1273982

Re: Comment on obliterate functional specification

Posted by Jack Repenning <jr...@collab.net>.

On Mar 3, 2009, at 8:36 AM, Magnus Torfason wrote:

> I have to think about svn blame. Are you saying that "svn blame"
> should continue to return the same output as before the obliteration?

Sorry, no, not particularly. I was using "svn blame" as a short hand  
for "infinite knowledge about the ancestry of every byte in every path/ 
rev," and specifically using it to describe our knowledge of the repo  
just *before* the obliteration. Guess I over-simplified my wording a  
mite. To restate the paragraph without mentioning "blame,"

[In the disc-space story,] we want to remove the space no longer in  
use for any path/revs that should remain available post-obliteration,  
but the space that makes up some ancient delta which is still in use,  
post-obliteration, we should not remove. That is: a post-obliterate  
checkout of path@HEAD should show the same result as it did before  
obliteration, even if the post-obliteration checkout includes some  
text which was introduced into the repository during some now-removed  
revision.

That is, I was drawing the distinction between the "security" story,  
which wants to remove *information* even from current versions, and  
the "space" story, which wants no change in current and near-current  
(that is, post-obliteration) checkouts.

Or, to point up the difference in another way:

- "space" wants to save disc space, requires no change in recent  
revisions (and working copy continuity), and is willing to sacrifice  
invariance of checkouts of older revisions.

- "security" wants to remove information, requires absolute removal  
throughout all revisions, and is willing to sacrifice working copy  
continuity.

It's remarkably hard for me to think of these two things as the same  
operation! I would call the "disc space story" something else,  
"archive," because as a practical matter all our customers keep asking  
us for this function, and they always call it "archive." I would leave  
the name "obliterate" for the "security story," because though  
relatively few of our customers ever mention this, when it comes up,  
that's the sort of term they use for it.

-==-
Jack Repenning
Chief Technology Officer
CollabNet, Inc.
8000 Marina Boulevard, Suite 600
Brisbane, California 94005
office: +1 650.228.2562
mobile: +1 408.835.8090
raindance: +1 877.326.2337, x844.7461
aim: jackrepenning
skype: jackrepenning
twitter: http://twitter.com/jrep

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=1269073

Re: Comment on obliterate functional specification

Posted by Magnus Torfason <zu...@gmail.com>.

Hi Jack,

I would say that you both correctly spotted the problem (that
the complexity of consistently modifying the history of the repository
is magnified because of the wide variety of use-cases), and my proposed
solution (to try to factor some of the complexity out of what could
be thought of the "core" obliterate functionality, so that it could be
"dealt with later").

The question is then, is my proposed solution feasible? Needless to
say, I think it is. See specific comments below.

On Mar 2, 2009, at 8:17 PM, Jack Repenning wrote:
 > I seem to see a problem here, or perhaps I only fail to see the
 > solution. Let me spin a user story and see where it takes us.
 >
 > Suppose we're dealing with the "security" form of the problem: some
 > information has been introduced into the repository that ought not to
 > have been, and we need to ensure that it disappears, as thoroughly as
 > possible. Suppose, further, that this sensitive information was
 > introduced in the form of comment text in a source-code file. The
 > error was introduced as a change at the/bad/path@BADREV. Changes to
 > the/bad/path have also been made in (BADREV+1) and so on. Feel free to
 > assume any ugly thing you like such as copies, post-BADREV, to other
 > paths.
 >
 > In such a situation, it's not just the/bad/path@BADREV that must be
 > expunged, but in fact all the later revisions based on it (unless,
 > indeed, we can positively determine that someone edited that text out
 > again at some later date).

Yes, absolutely. And all kinds of usability issues arise, not only
copies, but merges, too. And should we purge the copies, but leave the
merges, or vice versa. As you say, ugly.

 > So either the OBLITERATION SET includes the/bad/path@BADREV and also
 > all derived paths and revs (in which case, we need to automate finding
 > them all, 'cause depending on the peoples for this won't fly), or
 > alternatively some files@REVS not in the OBLITERATION SET need to have
 > check-outs which differ depending on whether they come from the
 > "original" or "modified" repository.
 >
 > Which did you have in mind?

The former.

And yes, my idea is to automate finding them all. It's just that I
think that "finding", or "constructing the correct obliteration set"
is going to seem much more manageable if we are absolutely clear on
what happens after the set has been defined, and don't have to worry
about that as well.

Writing code that messes with the repository data while leaving it in
a well defined and consistent state is a challenging task as it is,
even if the functionality is 100% defined.

 > But conversely, if we're dealing with the disc-space form of the
 > problem, then we exactly do not want these later paths@REVS affected.

Exactly.

 > We want to remove the space no longer in use, but the space that makes
 > up some ancient delta which is still in use we should not remove, but
 > rather keep. A checkout of path@HEAD should show the same result,
 > including lines that "svn blame" would show us were added at r1, even
 > though we've removed (what we can of) revs 1-10000.

I absolutely agree that (core) obliterating ^/@1:10000 should have *no*
effect on the bytes returned by a checkout of HEAD, in a repository
that was up to revision 10001 before obliteration.

I have to think about svn blame. Are you saying that "svn blame"
should continue to return the same output as before the obliteration?
That does not seem right to me. I would say that after the above
obliteration the repository would look like it had 10000 empty
commits, and one huge commit in the end. Everything would look as if
the author of the last commit had added everything. After all, blame
is just a function of the revision in which a line was added to
the repository and of the revision properties.

 > So it seems like one form of obliterate most definitely _does_ want
 > some sort of closure used based on the indicated problem point, while
 > the other form most definitely does _not_ want that closure applied.

Agreed, so after the first implementation of obliterate, which might
have the syntax:

svn obliterate ^/bad/path/very/bad/path@13:666

   We might add switches to the command of the form:

svn obliterate --include-descendants ^/path@100
svn obliterate --include-descendants --include-copies-from ^/path@100
svn obliterate --include-descendants --include-merges-from ^/path@100

   And of course, if we want to find the ancestors instead:

svn obliterate --include-ancestors ^/path@100
svn obliterate --include-ancestors --include-copies-to ^/path@100
svn obliterate --include-ancestors --include-merges-to ^/path@100

   It would also be very reasonable to interpret

svn obliterate ^/bad/path

   as a shorthand for

svn obliterate ^/bad/path@0:HEAD

   But the list does not stop here. What about the following use-case,
   which may seem silly, but is actually quite reasonable in some
   work flows:

svn obliterate --find-me-all-psd-files-older-than-three-months-
     that-have-modifications-occurring-less-than-one-week-apart-
     and-obliterate-the-next-to-last-commit-in-the-series-
     then-repeat-until-there-is-at-least-one-week-
     between-deltas ^/my/really/big/photoshop/projects

   (Of course, the above syntax is silly in any case).

   And as Brane noted, obliterating key links in the revision tree
   may be undesirable (even if the result is well-defined), so
   we might imagine:

svn obliterate --exclude-copies-from ^/old/and/big

   And so on ...

I think all of these use-cases, and more, can be implemented on
top of an "obliteration-set" driven core functionality. Some of
them can eventually (or immediately) find their way into the
utility that subversion users see, others will only be available
in perl scripts operating on log files (but note that all of them
could be implemented through "svn log", "perl" and
"core obliteration".)

Furthermore, if agreement is reached these use-cases will find
their way into obliterate-functional-spec.txt as "add-on"
features, of different priority.

Best,
Magnus

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=1269031

Re: Comment on obliterate functional specification

Posted by Jack Repenning <jr...@collab.net>.

On Mar 2, 2009, at 9:09 PM, Branko Čibej wrote:

> Hyrum K. Wright wrote:
>>
>> On Mar 2, 2009, at 10:05 PM, Branko Cibej wrote:
>>
>>> Hyrum K. Wright wrote:
>>
>> You're asking a version control system to remove data, for goodness
>> sakes.  That's just dangerous and if you don't have adult  
>> supervision,
>> you get what you ask for.
>
> :) Well, I find letting your version control system collapse a file's
> history a lot less scary than letting said system edit the file's
> content throughout its history. The one is a well-defined operation,  
> the
> other is fuzzy at best and gets fuzzier along the line -- not to  
> mention
> that you can't avoid breaking all working copies in existence.

Considering just how heretical this sort of removal always seems to VC  
folks, I'm actually on Brane's side: better to publish a list of  
proposed changes, than to run off and do it. I've worn several hats,  
and straddled several fences, in this area for years, but to  
personify: if "I," the person with the compelling security problem and  
the not-quite-but-almost-as-compelling need to keep getting my real  
work done, come to "you," the gloriously and normally commendably  
compulsive data preserving VC person, and ask, in full recognition of  
the VC heresy, yet none the less in absolute earnest, that you expunge  
a bit of history ... well, then, I'm frankly inclined to want to check  
over what you do, because deep in my heart I know that deep in your  
heart you're only doing this under protest, and that you don't have  
that visceral understanding of the problem necessary to make proper  
edge-case calls.

-==-
Jack Repenning
Chief Technology Officer
CollabNet, Inc.
8000 Marina Boulevard, Suite 600
Brisbane, California 94005
office: +1 650.228.2562
mobile: +1 408.835.8090
raindance: +1 877.326.2337, x844.7461
aim: jackrepenning
skype: jackrepenning
twitter: http://twitter.com/jrep

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=1262267

Re: Comment on obliterate functional specification

Posted by Magnus <ac...@zulutime.net>.

Thanks for the encouragement, Julian. As a matter of fact, I
had written up more on the definition, but had intended to hold
off until after relese 1.6, assuming that things would ease up 
after that. However, I will send what I have prepared now,
and would welcome any comments.

The following text would belong somewhere early in a revised
functional specification for obliteration:

-------------------------------------------------

DEFINITION OF THE OBLITERATION OPERATION

  An OBLITERATION SET is defined by a list of PATH@REVISON elements 
  (that is, each element is a pair, consisting of a PATH and REVISION). 
  The same PATH can be paired with multiple REVISIONS to form 
  multiple elements and vice versa.

    Note: The set is restricted so that if, for a given REVISION, 
    PATH@REVISION is part of the OBLITERATION SET, any element of 
    the of the form [PATH/RELATIVEPATH]@REVISION is also part of 
    the set. (This simply means that if a directory change is 
    obliterated in a revision, all changes to its contents must 
    also be obliterated in the same revision).
    [Note on the note. Perhaps this restriction can be lifted.
    However, it seems that doing so would greatly complicate
    both the behavior and implementation of the operation,
    without much benefit.]

  An ORIGINAL repository is a repository to which an OBLITERATION 
  operation could be applied, but has not (this includes any 
  subversion repository without obliterations).

  A MODIFIED repository is a repository which is identical to the 
  ORIGINAL but for which an OBLITERATION SET has been defined and 
  an OBLITERATION operation has been applied.

  The OBLITERATION operation is defined by the following two properties:

    1. If a PATH@REVISION is checked out of the MODIFIED repository,  
       and the PATH@REVISION is NOT in the OBLITERATION SET, the 
       checkout data is identical to what would have been returned 
       if PATH@REVISION had been checked out of the ORIGINAL.
       
    2. If a PATH@REVISION is checked out of the MODIFIED repository,  
       and the PATH@REVISION IS in the OBLITERATION SET, the 
       checkout data is identical to what would have been returned 
       if PATH@REVPRIOR had been checked out of the ORIGINAL, where 
       REVPRIOR is the last revision prior to REVISION for which
       PATH@REVPRIOR is not in the OBLITERATION SET.
       
    3. Any other mechanism through which a user can interact with
       the repository (diff/merge/copy/commit/etc) should work
       consistently. That is, assume that a REFERENCE repository 
       existed from which nothing had been obliterated, but for 
       which any checkout operation yielded the same data as for the 
       MODIFIED repository. Then every remote interaction with
       MODIFIED must yield a result indistinguishable from what 
       would happen if the same operation were applied to the 
       REFERENCE repository.
       
     Note: Here, data refers to the reported existence of the path,
     the versioned properties that apply to the path, and for files,
     the actual contents of the file.
     
     Note: This definition does not state what happens to  
     revision properties (several options are available), and it
     does not state what happens to the reported history of 
     the path (again, several options are available).
     
     Note: Implicit in the above is the fact that the core 
     OBLITERATION functionality would not drop empty revisions. 
     This is intentional, and dropping empty revisions should be
     done through a separate mechanism.
     
-------------------------------------------------     

The above definition fulfills several desirable criteria:
 * It is in my view parsimonious
 * It is relatively short
 * It has clearly defined behavioral implications
 
However, the make-or-break criteria are of course two:
 * Can obliteration, as defined above, be feasibly implemented?
 * Would such an implementation address all required use-cases?
 
I believe the answer to both of the above questions to be yes,
and I would be happy to elaborate on why I believe this to 
be the case, through discussions on the mailing list and through
patches to the functional specification.

Best regards,
Magnus

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=1108134