You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by li...@m8y.org on 2019/03/29 15:45:43 UTC

svn obliterate - more feasible these days?

Hi. As a bit of an introduction...  I was hanging out on irc://irc.freenode.net/svn since we use svn at work due to its simplicity, flexibility with regards to tag/branch structuring, easy narrow/shallow, centralisation, quite a lot of supporting tooling out there..

However for the FOSS project (Hedgewars) that I contribute to we use mercurial, and I'm also a fan.  Stuff I like:
   * clear commandline with some overlap with svn (and unlike git considerably messier syntax).
   * revsets as an incredibly powerful query language (a #svn dev mentioned some interest in this)
   * grep --all for regex based searching of past (svn would need to do that serverside presumably, if there was any interest in grepping changesets - I think someone in #svn also was interested in this at some point)
   * fast-annotate --deleted for blame annotation with deleted lines.  (https://m8y.org/tmp/blame1.xhtml - example)
   * quite a lot of flexibility in date query format (this ties into the revsets, and I worked around it a bit for my basic svn log needs with https://m8y.org/tmp/svndatelog.txt )

That said, it's still DVCS so has clear problems with narrow/shallow still, despite recent work on that it still mangles history, and is more complex for beginners.  And is not ideal for the corporate monorepo where you track everything related to the project.  Still love my svn.

Now, when I run into an hg feature that I particularly find useful, I ask #svn if there's an equivalent for use at work.
In this case it was hg censor which allows easy removal of a change from the history.
Intended for fixing issues with licensing or accidentally uploaded keys or passwords.

I asked what svn had along those lines and following conversation ensued:
https://m8y.org/chats/svn_obliterate.xhtml
^^^ Link to IRC chatlog of mostly danielsh explaining complexities with amending history in svn ^^^

danielsh suggested I move this to the list. So I did.
Is he right in that it might be easier to implement with this new filesystem? Are there still gotchas due to svn's design?

Thanks!
nemo on #svn

Re: svn obliterate - more feasible these days?

Posted by Stefan Fuhrmann <st...@apache.org>.
On 01.04.19 15:13, lists@m8y.org wrote:
> On Fri, 29 Mar 2019, Johan Corveleyn wrote:
>
>> On Fri, Mar 29, 2019 at 7:25 PM <li...@m8y.org> wrote:
>> ...
>>> Now, when I run into an hg feature that I particularly find useful, 
>>> I ask #svn if there's an equivalent for use at work.
>>> In this case it was hg censor which allows easy removal of a change 
>>> from the history.
>>> Intended for fixing issues with licensing or accidentally uploaded 
>>> keys or passwords.
>>>
>>> I asked what svn had along those lines and following conversation 
>>> ensued:
>>> https://m8y.org/chats/svn_obliterate.xhtml
>>> ^^^ Link to IRC chatlog of mostly danielsh explaining complexities 
>>> with amending history in svn ^^^
>>>
>>> danielsh suggested I move this to the list. So I did.
>>> Is he right in that it might be easier to implement with this new 
>>> filesystem? Are there still gotchas due to svn's design?
>>
>> Hi Nemo, thanks for bringing this to the list :-).
>>
>> I can't comment on the feasibility of implementing this in the
>> filesystem, and whether FSFS f7 (with logical addressing enabled -- it
>> is the default for f7, but is optional) makes it easier than f6.
>> Perhaps Stefan Fuhrman, who wrote most of the FSFS7 code, can share
>> some insight ...
>>
>> However, at the Aachen 2017 hackathon we ended up discussing
>> obliterate a bit [1] ("What hackathon is complete without a discussion
>> of obliterate?"). We focused on another hairy part of the problem:
>> what (should) happen(s) with existing working copies? How should
>> clients handle the rewritten history?
>>
>> Some options:
>>  (-1) Client doesn't notice the history change. Existing working
>> copies may or may not break at any time, in unpredictable ways.
>>
>>  (0) Client detects the changed history, and errors out with: "your
>> working copy has become unusable, check out a new one". This is
>> already possible today, by changing the UUID of the repository.
>>
>>  (1) Only working copies which are affected by the history change get
>> invalidated.
>>
>>  (2) Working copies which are affected automatically adjust / rebase
>> / remove the obliterated content.
>>
>> See also this thread from last year [2] where some ideas were bounced
>> around (including a bit about "what should clients do with existing
>> working copies?" [3])
>
TL;DR: Obliterate has become quite feasible with fsfs7 but may not be
the best solution for your use-case.
> Eh, I don't feel it's a hijack, I'm curious if it's technically 
> feasible, but it's good to know people are actually thinking about 
> implementation issues.
> FWIW, I tried mercurial DVCS' censor and it worked pretty much as I 
> expected.
> That is, there's no attempt to alter the history of remote clones 
> (good IMO).
>
> So, if you cloned prior to the censor, you get the unmodified copy.  
> Further updates do not change this.
> If you clone after the censor you get the modified copy.
The equivalent problem in SVN manifests in any working copy containing
the obliterated data. In fact, if it did at any point in the past, the 
pristine
store may still contain it. There is no enforced "update + cleanup".
>
> I don't know how well this maps to SVN's centralised approach, but 
> treating the working copy similarly makes sense to me...
Yes.

As Johan already said, there should be should be a way to validate a
given working copy against the repository. Without that, you may see
things like "checksum mismatch" errors during commit etc. We don't
have that feature right now, though.
>
> Possibly related, what happens to working copies now, if I use 
> svndumpfilter or authz to hide/remove a file from the repository?
Semantically, they break as they refer to a repository that no longer
exists. In practice, it depends: If your revision(s) of your checked out
branch did not change, you should be fine. It is just, that there is no
way for the user to tell whether that is the case.

Back to the actual obliterate. Subversion has been designed to never
modify history. That makes it fool-prove as you can always revert to
any previous revision (which may also be a legal requirement in some
cases). So, if you don't want certain files in your project, just delete
them and commit; you will never lose any data.

That already covers a use-case that is annoying to handle in VCS
which replicate the whole repo / history on the client. Furthermore,
if the data is legal to be in the repository (e.g. nothing that your
company has no right to), you might as well keep it. Set up authz
on it to hide it from non-authorized users, if you need to be sure
that sensitive data will be protected anywhere in the history. Authz
should be fast enough these days.

Beyond that point, obliterate becomes an option. Here is how you
might implement it on FSFSv7:

(1) Identify the node-revision(s) that contain data to be removed.
     Note: removing revisions themselves is a lot more work with
     little added benefit.
(2) Those noderevs point to the actual representations to obliterate.
(3) Scan the repository for all noderefs and representations (delta!)
     that point to representations to obliterate. Thanks to the index
     data in FSFSv7, this is basically a linear read at full disk 
throughput.
(4) Update dependent data.
(5) Remove obliterated representations.
(6) Bump instance ID and tell users to checkout anew
     (maybe, change the repo URL?).

What you do in (4) depends on you use-case. Representations (file,
directory and property content) is usually stored as delta against
some previous representation. One option is replace all nodes along
the delta tree with empty representations. Another option is to store
their contents in full instead of a delta against obliterated data.

FSFSv7 makes it easy to change the side of piece of data without
breaking any pointers: The index contains the mapping.

As far as removing representations (5), I would suggest to replace
them with empty ones. If you need to remove any trace of them,
then you must also scan for directories in (3) and must update
their representations, too, if they reference obliterated data. That
would be somewhat slower but can still be done with a single scan.
You need to track more info on the fly, though.

I hope that this short sketch gives a good idea of what needed to
be done on the server side. A simple version of this (replace with
empty, rewrite dependents with full content) should be doable with
a couple of hundred LOCs. Right now, nobody is actually working
on this, though. If you have the spare cycles, take a look at the
fsfs-stats code for how to scan v7 repos and give it a go.

-- Stefan^2.

Re: svn obliterate - more feasible these days?

Posted by li...@m8y.org.
On Fri, 29 Mar 2019, Johan Corveleyn wrote:

> On Fri, Mar 29, 2019 at 7:25 PM <li...@m8y.org> wrote:
> ...
>> Now, when I run into an hg feature that I particularly find useful, I ask #svn if there's an equivalent for use at work.
>> In this case it was hg censor which allows easy removal of a change from the history.
>> Intended for fixing issues with licensing or accidentally uploaded keys or passwords.
>>
>> I asked what svn had along those lines and following conversation ensued:
>> https://m8y.org/chats/svn_obliterate.xhtml
>> ^^^ Link to IRC chatlog of mostly danielsh explaining complexities with amending history in svn ^^^
>>
>> danielsh suggested I move this to the list. So I did.
>> Is he right in that it might be easier to implement with this new filesystem? Are there still gotchas due to svn's design?
>
> Hi Nemo, thanks for bringing this to the list :-).
>
> I can't comment on the feasibility of implementing this in the
> filesystem, and whether FSFS f7 (with logical addressing enabled -- it
> is the default for f7, but is optional) makes it easier than f6.
> Perhaps Stefan Fuhrman, who wrote most of the FSFS7 code, can share
> some insight ...
>
> However, at the Aachen 2017 hackathon we ended up discussing
> obliterate a bit [1] ("What hackathon is complete without a discussion
> of obliterate?"). We focused on another hairy part of the problem:
> what (should) happen(s) with existing working copies? How should
> clients handle the rewritten history?
>
> Some options:
>  (-1) Client doesn't notice the history change. Existing working
> copies may or may not break at any time, in unpredictable ways.
>
>  (0) Client detects the changed history, and errors out with: "your
> working copy has become unusable, check out a new one". This is
> already possible today, by changing the UUID of the repository.
>
>  (1) Only working copies which are affected by the history change get
> invalidated.
>
>  (2) Working copies which are affected automatically adjust / rebase
> / remove the obliterated content.
>
> See also this thread from last year [2] where some ideas were bounced
> around (including a bit about "what should clients do with existing
> working copies?" [3])

Eh, I don't feel it's a hijack, I'm curious if it's technically feasible, but 
it's good to know people are actually thinking about implementation issues.
FWIW, I tried mercurial DVCS' censor and it worked pretty much as I expected.
That is, there's no attempt to alter the history of remote clones (good IMO).

So, if you cloned prior to the censor, you get the unmodified copy.  Further updates do not change this.
If you clone after the censor you get the modified copy.

I don't know how well this maps to SVN's centralised approach, but treating the working copy similarly makes sense to me...

Possibly related, what happens to working copies now, if I use svndumpfilter or authz to hide/remove a file from the repository?

>
> Anyway, I didn't want to hijack this thread, feel free to ignore this
> and focus on the filesystem-rewrite issue.
>
> [1] https://cwiki.apache.org/confluence/display/SVN/Aachen2017#Aachen2017-Obliterate
> [2] https://svn.haxx.se/dev/archive-2018-03/0155.shtml (Script to
> obliterate the most recent revision(s))
> [3] https://svn.haxx.se/dev/archive-2018-03/0171.shtml

-- 
----------------------------------------
Free Mickey!
http://randomfoo.net/oscon/2002/lessig/
http://www.law.duke.edu/cspd/comics/zoomcomic.html
My key: https://m8y.org/keys.html

Re: svn obliterate - more feasible these days?

Posted by Johan Corveleyn <jc...@gmail.com>.
On Fri, Mar 29, 2019 at 7:25 PM <li...@m8y.org> wrote:
...
> Now, when I run into an hg feature that I particularly find useful, I ask #svn if there's an equivalent for use at work.
> In this case it was hg censor which allows easy removal of a change from the history.
> Intended for fixing issues with licensing or accidentally uploaded keys or passwords.
>
> I asked what svn had along those lines and following conversation ensued:
> https://m8y.org/chats/svn_obliterate.xhtml
> ^^^ Link to IRC chatlog of mostly danielsh explaining complexities with amending history in svn ^^^
>
> danielsh suggested I move this to the list. So I did.
> Is he right in that it might be easier to implement with this new filesystem? Are there still gotchas due to svn's design?

Hi Nemo, thanks for bringing this to the list :-).

I can't comment on the feasibility of implementing this in the
filesystem, and whether FSFS f7 (with logical addressing enabled -- it
is the default for f7, but is optional) makes it easier than f6.
Perhaps Stefan Fuhrman, who wrote most of the FSFS7 code, can share
some insight ...

However, at the Aachen 2017 hackathon we ended up discussing
obliterate a bit [1] ("What hackathon is complete without a discussion
of obliterate?"). We focused on another hairy part of the problem:
what (should) happen(s) with existing working copies? How should
clients handle the rewritten history?

Some options:
  (-1) Client doesn't notice the history change. Existing working
copies may or may not break at any time, in unpredictable ways.

  (0) Client detects the changed history, and errors out with: "your
working copy has become unusable, check out a new one". This is
already possible today, by changing the UUID of the repository.

  (1) Only working copies which are affected by the history change get
invalidated.

  (2) Working copies which are affected automatically adjust / rebase
/ remove the obliterated content.

See also this thread from last year [2] where some ideas were bounced
around (including a bit about "what should clients do with existing
working copies?" [3])

Anyway, I didn't want to hijack this thread, feel free to ignore this
and focus on the filesystem-rewrite issue.

[1] https://cwiki.apache.org/confluence/display/SVN/Aachen2017#Aachen2017-Obliterate
[2] https://svn.haxx.se/dev/archive-2018-03/0155.shtml (Script to
obliterate the most recent revision(s))
[3] https://svn.haxx.se/dev/archive-2018-03/0171.shtml

-- 
Johan

Re: svn obliterate - more feasible these days?

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.
lists@m8y.org wrote on Fri, 29 Mar 2019 18:26 +00:00:
> Hi. As a bit of an introduction...  I was hanging out on 
> irc://irc.freenode.net/svn since we use svn at work due to its 
> simplicity, flexibility with regards to tag/branch structuring, easy 
> narrow/shallow, centralisation, quite a lot of supporting tooling out 
> there..

Welcome to the list :)

> I asked what svn had along those lines and following conversation 
> ensued:
> https://m8y.org/chats/svn_obliterate.xhtml
> ^^^ Link to IRC chatlog of mostly danielsh explaining complexities with 
> amending history in svn ^^^

For the archives, that conversation starts here:
.
    https://colabti.org/irclogger/irclogger_log/svn?date=2019-03-27#l124
.
and continues on and off during the next two days.

Cheers,

Daniel