You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@subversion.apache.org by Julian Foad <ju...@apache.org> on 2022/01/11 14:26:09 UTC

Re: A two-part vision for Subversion and large binary objects.

Hello everyone. Thanks to sponsorship arranged by Karl, I'm able to work on completing this.

It's fantastic to see that Evgeny has made a working prototype and that many of you already followed with constructive suggestions.

Right now I'm reviewing this work and the long discussion about it and I will come back with a summary and plan for your consideration, and no doubt many questions, in the next few days.

-- 
- Julian

Re: A two-part vision for Subversion and large binary objects.

Posted by Karl Fogel <kf...@red-bean.com>.

On 16 Jan 2022, Branko Čibej wrote:
>On 14.01.2022 21:29, Julian Foad wrote:
>>> multi-wc-format branch [...] anything I'm missing?
>> As soon as I stepped away I could see more clearly: Basically
>> 'multi-wc-format' is just providing an API up to the client 
>> layer for
>> enumerating WC format variants. The same logical functionality 
>> could be
>> implemented for a particular feature (e.g. pristines-on-demand) 
>> by
>> looking directly at some other on-disk representation change to
>> distinguish variants within one format number; this branch just
>> formalizes and generalizes it. It doesn't make more or better
>> compatibility than could be done ad-hoc; it doesn't magically 
>> make old
>> clients be able to work with newer formats/variants than they 
>> know about.
>
>I expect you mean, "that they *don't* know about."

I think you might have read "that" where Julian wrote "than" :-).

The pristineless-hack branch (was: Re: A two-part vision for Subversion and large binary objects.)

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Daniel Shahaf wrote on Mon, Jan 24, 2022 at 09:14:32 +0000:
> Incidentally (and off-topic for this particular subthread), there's also
> a pristineless-hack branch:
> 
>     https://svn.apache.org/r1769826

The branch fails to build because temp_pristines.h was not committed.
Stefan, I don't suppose you still have a copy of that file?  It's
a 5+ years old hack that was committed directly from a trunk wc, so I'm
not holding my breath, but asking just in case.

The branch's design seems to be:

- No pristines stored (see svn_wc__db_pristine_check())

- Pass around libsvn_wc a callback, supplied by libsvn_client, that
  fetches files from libsvn_ra.  (The callback's implementation is
  fetch_rev_file(), of type rev_file_func_t.)

- Pristines installed where needed (e.g., revert_wc_data())

Presumably, pristines are vacuumed at some point (otherwise the
svn_wc__db_pristine_check() short-circuit would be wrong), but I haven't
been able to find where.

Cheers,

Daniel

Re: A two-part vision for Subversion and large binary objects.

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Branko Čibej wrote on Sun, Jan 16, 2022 at 21:37:15 +0100:
> On 14.01.2022 21:29, Julian Foad wrote:
> > So, in the context of whether it makes sense to adopt the
> > 'multi-wc-format' branch versus implementing the WC pristines code to
> > work ad-hoc with two on-disk variants, it comes down to:
> > 
> > - cost of adopting (reviewing etc.) the 'multi-wc-format' vs.
> > implementing something ad-hoc for pristines;
> > - potential future benefit from re-using 'multi-wc-format' for other changes.
> 
> My original motivation to starting multi-wc-format was to implement
> compressed pristines and in-wc-db pristines for small (for some definition
> of) files. There may be a branch for that (I don't recall)

There are two:

* compressed-pristines
  (https://svn.apache.org/repos/asf/subversion/branches/compressed-pristines/BRANCH-README?p=r1897404)

* better-pristines
  (https://svn.apache.org/repos/asf/subversion/branches/better-pristines/BRANCH-README?p=r1897404)

Incidentally (and off-topic for this particular subthread), there's also
a pristineless-hack branch:

    https://svn.apache.org/r1769826

> and some infrastructure work in spillbufs that could be reused.
> 
> > Cost of adopting 'multi-wc-format' looks likely to be lower. It is
> > mostly boiler-plate changes and simple code, and is fairly small
> > compared with the overall task:
> > 
> > 'pristines-on-demand' diff size: +2700 -1000 lines roughly
> > 'multi-wc-format'     diff size: +1000 -200 lines roughly
> > 
> >  From this perspective, it looks best to adopt 'multi-wc-format' unless
> > we find some show-stopper problem with it.
> 
> That would be my recommendation, too, if only to have just one specific way
> to add feature support to the WC without requiring a WC upgrade every time.
> We've mostly tied WC features to the format version, it would seem natural
> to keep that correlation.

Re: A two-part vision for Subversion and large binary objects.

Posted by Branko Čibej <br...@apache.org>.

On 14.01.2022 21:29, Julian Foad wrote:
>> multi-wc-format branch [...] anything I'm missing?
> As soon as I stepped away I could see more clearly: Basically
> 'multi-wc-format' is just providing an API up to the client layer for
> enumerating WC format variants. The same logical functionality could be
> implemented for a particular feature (e.g. pristines-on-demand) by
> looking directly at some other on-disk representation change to
> distinguish variants within one format number; this branch just
> formalizes and generalizes it. It doesn't make more or better
> compatibility than could be done ad-hoc; it doesn't magically make old
> clients be able to work with newer formats/variants than they know about.

I expect you mean, "that they *don't* know about."

> So, in the context of whether it makes sense to adopt the
> 'multi-wc-format' branch versus implementing the WC pristines code to
> work ad-hoc with two on-disk variants, it comes down to:
>
> - cost of adopting (reviewing etc.) the 'multi-wc-format' vs.
> implementing something ad-hoc for pristines;
> - potential future benefit from re-using 'multi-wc-format' for other changes.

My original motivation to starting multi-wc-format was to implement 
compressed pristines and in-wc-db pristines for small (for some 
definition of) files. There may be a branch for that (I don't recall) 
and some infrastructure work in spillbufs that could be reused.

> Cost of adopting 'multi-wc-format' looks likely to be lower. It is
> mostly boiler-plate changes and simple code, and is fairly small
> compared with the overall task:
>
> 'pristines-on-demand' diff size: +2700 -1000 lines roughly
> 'multi-wc-format'     diff size: +1000 -200 lines roughly
>
>  From this perspective, it looks best to adopt 'multi-wc-format' unless
> we find some show-stopper problem with it.

That would be my recommendation, too, if only to have just one specific 
way to add feature support to the WC without requiring a WC upgrade 
every time. We've mostly tied WC features to the format version, it 
would seem natural to keep that correlation.

> p.s. I forgot to say, in updating 'multi-wc-format' I found a few small
> problems, two of which I fixed; one remains, causing the 'upgrade' tests
> to fail, which is due to '#ifdef SVN_TEST_MULTI_WC_FORMAT' in
> 'wc-metadata.sql', which doesn't get processed as a directive and needs
> to be handled some other way.

I know about that one, yes. I essentially ran out of oomph before fixing 
it; it might be as simple as a adding a small enhancement to the SQL 
statement processing scripts.

-- Brane

Re: A two-part vision for Subversion and large binary objects.

Posted by Julian Foad <ju...@foad.me.uk>.

> multi-wc-format branch [...] anything I'm missing?

As soon as I stepped away I could see more clearly: Basically
'multi-wc-format' is just providing an API up to the client layer for
enumerating WC format variants. The same logical functionality could be
implemented for a particular feature (e.g. pristines-on-demand) by
looking directly at some other on-disk representation change to
distinguish variants within one format number; this branch just
formalizes and generalizes it. It doesn't make more or better
compatibility than could be done ad-hoc; it doesn't magically make old
clients be able to work with newer formats/variants than they know about.

So, in the context of whether it makes sense to adopt the
'multi-wc-format' branch versus implementing the WC pristines code to
work ad-hoc with two on-disk variants, it comes down to:

- cost of adopting (reviewing etc.) the 'multi-wc-format' vs.
implementing something ad-hoc for pristines;
- potential future benefit from re-using 'multi-wc-format' for other changes.

Cost of adopting 'multi-wc-format' looks likely to be lower. It is
mostly boiler-plate changes and simple code, and is fairly small
compared with the overall task:

'pristines-on-demand' diff size: +2700 -1000 lines roughly
'multi-wc-format'     diff size: +1000 -200 lines roughly

From this perspective, it looks best to adopt 'multi-wc-format' unless
we find some show-stopper problem with it.

p.s. I forgot to say, in updating 'multi-wc-format' I found a few small
problems, two of which I fixed; one remains, causing the 'upgrade' tests
to fail, which is due to '#ifdef SVN_TEST_MULTI_WC_FORMAT' in
'wc-metadata.sql', which doesn't get processed as a directive and needs
to be handled some other way.

- Julian

Re: A two-part vision for Subversion and large binary objects.

Posted by Karl Fogel <kf...@red-bean.com>.

On 14 Jan 2022, Julian Foad wrote:
>I looked into the multi-wc-format branch today. Brane wrote 
>previously:
>
>> This basically needs the following:
>> 
>> * a huge sync with trunk;
>
>Done.

Nice :-).  (Understatement is the height of style, ahem.)

>Now I'm considering what would be the pros and cons of using
>'multi-wc-format', compared with if it's possible to modify the
>pristines-on-demand implementation to work with no format bump 
>but just
>an on-disk variation of the current WC format. (We haven't proved 
>it's
>possible nor that it's impossible.)

It's worth some exploration.  If we absolutely have to have a WC 
format bump, it's not the end of the world, but it'd be great if 
we can avoid it.

>As soon as I stepped away I could see [the differences] more 
>clearly: 
>Basically 'multi-wc-format' is just providing an API up to the 
>client layer for enumerating WC format variants. The same logical 
>functionality could be implemented for a particular feature 
>(e.g. pristines-on-demand) by looking directly at some other 
>on-disk representation change to distinguish variants within one 
>format number; this branch just formalizes and generalizes it. It 
>doesn't make more or better compatibility than could be done 
>ad-hoc; it doesn't magically make old clients be able to work 
>with newer formats/variants than they know about.

Nice summary.  This makes complete sense and is what I sort of 
suspected the 'multi-wc-formats' branch was doing, in fact.

>So, in the context of whether it makes sense to adopt the
>'multi-wc-format' branch versus implementing the WC pristines 
>code to
>work ad-hoc with two on-disk variants, it comes down to:
>
>- cost of adopting (reviewing etc.) the 'multi-wc-format' vs.
>implementing something ad-hoc for pristines;
>- potential future benefit from re-using 'multi-wc-format' for 
>other changes.
>
>Cost of adopting 'multi-wc-format' looks likely to be lower. It 
>is
>mostly boiler-plate changes and simple code, and is fairly small
>compared with the overall task:
>
>'pristines-on-demand' diff size: +2700 -1000 lines roughly
>'multi-wc-format'     diff size: +1000 -200 lines roughly
>
>>From this perspective, it looks best to adopt 'multi-wc-format' 
>>unless
>we find some show-stopper problem with it.

Agreed.

>p.s. I forgot to say, in updating 'multi-wc-format' I found a few 
>small
>problems, two of which I fixed; one remains, causing the 
>'upgrade' tests
>to fail, which is due to '#ifdef SVN_TEST_MULTI_WC_FORMAT' in
>'wc-metadata.sql', which doesn't get processed as a directive and 
>needs
>to be handled some other way.

Ah, doesn't seem like a showstopper, certainly :-).

Best regards,
-Karl

Re: A two-part vision for Subversion and large binary objects.

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Julian Foad wrote on Fri, Jan 14, 2022 at 14:52:52 +0000:
> How would it work with 'multi-wc-format' branch included?
> =========================================================
⋮
> - old client, new-format WC (pristines-on-demand)
>   - errors out (cleanly)
>     > svn diff
>     > svn: E155021: This client is too old to work with the working copy at
> '/.../wc' (format 32). You need to get a newer [...]

That's SVN_ERR_WC_UNSUPPORTED_FORMAT.

> How it could work without a WC format bump and DB change?
> =========================================================
⋮
> - old client, WC with missing pristines
>   - errors out on some operations
>     > svn diff
>     > svn: E000002: Can't open file '/.../wc/.svn/pristine/03/03xxxx..xxxx.svn-base': No such file or directory
>   - [CHECK] Are there any scenarios that could involve data loss?

That's ENOENT.

> 
> Differences
> ===========
> 
> Essentially not much.  In both cases an old client can work with an old
> WC but would error out on a pristines-on-demand WC.  In both cases a
> newer client could work with both WCs without forcing upgrade.
> Differences in the error message don't seem significant. Is there
> anything I'm missing?

I think the difference in the error message _is_ significant.

- If the error message is seen by a human, the SVN_ERR_WC_UNSUPPORTED_FORMAT
  error message is a high-level ("porcelain") error message that
  advises what to do, while ENOENT is a low-level ("plumbing") error
  message that, because it includes an implementation detail (the full
  path to a particular .svn-base file), looks more like an invitation to
  file a bug.

- If the error is seen by a script, ditto.  A script that sees ENOENT
  has few choices other than to give up.  A script that sees
  SVN_ERR_WC_UNSUPPORTED_FORMAT can actually do something about it, such
  as run «svn upgrade» (if that makes sense in that script's use-case)
  or print an informative error message.

  FWIW, there's at least one script that looks for that particular error
  code: https://gitlab.com/zsh-org/zsh/-/blob/af0f497247150f55963e908097d04e543da55a4b/Functions/VCS_Info/Backends/VCS_INFO_get_data_svn#L24-37

So, +1 to doing a proper format bump.

Cheers,

Daniel

Re: A two-part vision for Subversion and large binary objects.

Posted by Julian Foad <ju...@apache.org>.

I looked into the multi-wc-format branch today. Brane wrote previously:

> This basically needs the following:
> 
> * a huge sync with trunk;

Done.

> * a way to pass the requested WC format from the command-line into the
WC library when creating a working copy (I never found a nice, clean
way to do that);

Your branch code passes the requested format down from libsvn_client to
libsvn_wc by calling svn_wc__ensure_adm(... format ...).  Maybe it's not
nice and clean but I am not sure whether and what could be better.  It
doesn't seem qualitatively worse than other interations with libsvn_wc.

> * an actual test for WC compatibility.

There doesn't seem very much to test: the "compatibility" we're speaking
of here is just whether the client accepts the format number as one that
it can work with, and if not then it errors out with the "client too
old" error, doesn't it?  I'll see if I can think of any useful smoke test.


Now I'm considering what would be the pros and cons of using
'multi-wc-format', compared with if it's possible to modify the
pristines-on-demand implementation to work with no format bump but just
an on-disk variation of the current WC format. (We haven't proved it's
possible nor that it's impossible.)


How would it work with 'multi-wc-format' branch included?
=========================================================

The 'multi-wc-format' branch allows the client to accept a WC declared
as an older format, and not insist on upgrading it. It does not provide any
special mechanism (such as feature flags) for making the client
behaviour compatible with different versions that it claims to support;
the client must provide that.

Outcome:

- old client, old-format WC 
  - fine

- old client, new-format WC (pristines-on-demand)
  - errors out (cleanly)
    > svn diff
    > svn: E155021: This client is too old to work with the working copy at
'/.../wc' (format 32). You need to get a newer [...]

- new client, old-format WC
  - works like the older version did
  - leaves the WC in the old format
  - optionally upgrades the WC to new format on request

- new client, new-format WC (pristines-on-demand)
  - works like pristines-on-demand
  - [OPTIONAL] a way to downgrade WC format, restoring all pristines


How it could work without a WC format bump and DB change?
=========================================================

We would make a variant of the old WC format, keeping the same format
number, and changing something else, something that makes old clients error
out as cleanly as possible.

- Example: Keep ".svn" and its files so still recognized as a WC, but rename
  e.g. "pristine" to "pristines-on-demand" so any operation involving
  pristines errors out early.  (Keeping the existing directory name and just
  omitting certain pristine files would be worse: that would lead to an
  operation succeeding on some files and then failing part way through when
  it reaches a file whose pristine is missing.)

Outcome:

- old client, WC with all pristines present
  - works as it always did

- old client, WC with missing pristines
  - errors out on some operations
    > svn diff
    > svn: E000002: Can't open file
'/.../wc/.svn/pristine/03/03xxxx..xxxx.svn-base': No such file or directory
  - [CHECK] Are there any scenarios that could involve data loss?

- new client, WC with all pristines present:
  - works with WC as-is
  - doesn't change WC in any way that would affect old clients, UNTIL
  - if user chooses to start omitting pristines, then removes them

- new client, WC with missing pristines
  - works like pristines-on-demand
  - [OPTIONAL] a way to change back to all-pristines-stay-present


Differences
===========

Essentially not much.  In both cases an old client can work with an old
WC but would error out on a pristines-on-demand WC.  In both cases a
newer client could work with both WCs without forcing upgrade.
Differences in the error message don't seem significant. Is there
anything I'm missing?


- Julian

Re: A two-part vision for Subversion and large binary objects.

Posted by Nathan Hartman <ha...@gmail.com>.

On Thu, Jan 20, 2022 at 4:03 PM Julian Foad <ju...@foad.me.uk> wrote:
>
> Karl Fogel wrote:
> >>> So if we have client-side configuration that can specify "no
> >>> pristine" based on some combination of one or more of...
> > [... size, properties, etc. ...]
> > with a general mechanism for combining conditions, then things
> > will be in a good position for future improvement.
>
> The more I think about this, the more I think we are prematurely
> complicating the requirements in this respect. I'm going to back-track
> and posit that a simple per-WC switch should suffice for the vast
> majority of cases, and has the benefit of simplicity. (The user might
> wish to set this based on the repository location -- local/fast versus remote/slow.)
>
> I will note that I previously misunderstood the current
> 'pristines-on-demand' implementation as fetching the pristine before a
> diff (for example) and discarding it afterwards.  In fact it keeps the
> pristine as long as the file in question remains in a locally modified
> state, and only discards the pristine when (before or after some client
> operation) the file is no longer in a modified state. That is to say, it
> fetches pristines less often than I had thought.

Thanks for explaining this.

If I understand correctly, when pristines are turned off:

The pristine for a given file is not fetched until the file is modified and an
operation pertaining to it requires the pristine.

Once fetched, the pristine remains present until the file becomes unmodified
through either 'commit' or 'revert'.

+1 to this. This seems more logical than immediate removal: if the file becomes
modified and its pristine is fetched, then the user may (is likely to?) run
additional operations requiring the pristine. No sense in re-fetching it
several times in rapid succession.

If users complain, 'svn cleanup --vacuum-pristines' could be made to just
delete all pristines when the pristines-on-demand feature is active.

While writing the above, it occurred to me that if the file is deleted, the
pristine (if present) should be deleted as well. (Potential caveat: if the file
is modified, subsequently marked for deletion with '--keep-local', and
subsequently the user runs 'svn revert', the expected result is to restore the
original contents as they appear in BASE.)

More below...

> The only case in which a simple per-WC setting might be unsatisfactory
> is the following combination:
>
>   - the repository is "slow" (and/or offline working is required);
>
>   - and, in a single WC:
>     - the WC data set is "huge" (relative to local disk space) in total; and
>     - there is a subset of files on which the user needs to work
> (requiring diffs, etc.) often enough that fetching their pristines "on
> demand" is a problem; and
>     - that subset of files is not "huge" in total; and
>     - that subset of files can be distinguished from the rest by metadata.
>
> That is certainly a possible case, but we have no suggestion that it is
> at all common. It is not one of the cases driving this feature. So I
> think it is not something to design for at this stage.
>
> I'm going to work on getting something more basic (per-WC yes/no) closer
> to production-ready and then we can re-assess it.

+1 to this also.

A production-ready per-wc on/off switch for pristines sounds reasonable to me
for the initial feature and is arguably better than not-production-ready and
full of bells and whistles.

Cheers,
Nathan

Re: A two-part vision for Subversion and large binary objects.

Posted by Julian Foad <ju...@apache.org>.

Julian Foad wrote:
> * Re-base pristines-on-demand on top of multi-wc-format.

Done: branch 'pristines-on-demand-on-mwf'.

> * Make pristines-on-demand behaviour conditional on WC format.

Mostly done in r1897977.

I could do with some help on a SQL test failure (below), please.

Quoting from that log message:

  - With 'make check WC_FORMAT_VERSION=1.15': test suite still passes.
  - With 'make check [WC_FORMAT_VERSION=1.8]': some tests FAIL or XPASS:

    XPASS: authz_tests.py 31: remove a subdir with authz file
    XPASS: basic_tests.py 8: basic corruption detection on commit
           [[Relies on wc.text_base_path()]]
    XPASS: revert_tests.py 2: revert reexpands manually contracted keyword
    XPASS: trans_tests.py 1: commit new files with keywords active from birth
           [[Relies on wc.text_base_path()]]
    XPASS: trans_tests.py 3: committing eol-style change forces text send
           [[Relies on wc.text_base_path()]]
    XPASS: update_tests.py 83: missing tmp update caused segfault
           [[The error message has changed]]
    XPASS: upgrade_tests.py 16: upgrade with base and working replaced files
           [[Can't fetch pristines: the working copy points to file:///tmp/repo]]
    XPASS: upgrade_tests.py 34: automatic SQLite ANALYZE
    FAIL:  wc-queries-test 3: test query expectations

From a quick look, the XPASSes may be just a matter of '@Wimp'
annotations that had been added on one of these branches, now being out
of date, or something like that. I will check them.

I have posted separately asking for help with the FAIL in test_query_expectations().

- Julian

Re: A two-part vision for Subversion and large binary objects.

Posted by Julian Foad <ju...@apache.org>.

>> The name of the "pristines-on-demand" branch implies a certain
>> behavior -- namely, that pristines can, via some UI, be fetched on
>> demand :-).  [...]
>
>Just to offer a counterpoint Karl, I always assumed the goal of the
>branch was to have no pristines in the WC and the "on-demand" aspect
>was referring to an internal SVN detail that it would have to fetch
>pristines when they were needed [...]

That name came, as far as I am aware, from Evgeny's branch which implements the latter.

This may be a case where the public facing name for the feature ought to differ from the internal development name.

Any ideas for a good public name?

Pristines on Subversion's demand?
Dehydrated WC? 

- Julian

Re: A two-part vision for Subversion and large binary objects.

Posted by Mark Phippard <ma...@gmail.com>.

On Wed, Feb 16, 2022 at 9:07 AM Mark Phippard <ma...@gmail.com> wrote:

> > FWIW, I just assumed that this *isn't* the intended entry point to
> > the feature.  That is, it's just how things happen to be on the
> > branch right now, but (presumably) Julian isn't saying that he
> > thinks this is how users should access the feature in real life.
>
> I also assume that to be the case but want to confirm.
>
> My "assumption" is that the 1.15 WC format includes some new database
> indicator(s) that specify whether or not pristines are being stored
> but the default 1.15 format would include pristines. There will be
> some other option that creates the 1.15 format but with the database
> indicator(s) set to indicate that pristines are NOT being stored.
>
> Presumably there will be some new UX as being discussed that
> implicitly creates a 1.15 format WC with these indicators set.
>
> So really the only use case for creating a 1.15 format using this more
> generic syntax is based on some future version of SVN that lets you
> selectively change this setting after a WC is created? Perhaps on a
> file/folder by file/folder basis.

Setting aside the bikeshedding on what we call this new feature ...
this is the behavior I would expect:

$ svn checkout    ==OR==
$ svn checkout --compatible-version=1.14

Creates a 1.14 compatible WC with pristines

$ svn checkout --compatible-version=1.15

Creates a 1.15 compatible WC with pristines ... there is currently no
reason for a user to do this but it leaves open the option for future
commands and options to selectively hydrate/dehydrate on a file by
file basis.

$ svn checkout --bare      ==OR==
$ svn checkout --compatible-version=1.15 --bare

Bikeshedding aside ... this creates a 1.15 compatible WC without pristines

Mark

Re: A two-part vision for Subversion and large binary objects.

Posted by Mark Phippard <ma...@gmail.com>.

On Wed, Feb 16, 2022 at 2:53 AM Karl Fogel <kf...@red-bean.com> wrote:

> >Are you saying this is how you would activate this no-pristines
> >feature? If so, that sounds like a poor UX. As a user, I would
> >not
> >expect the version number to be connected to a feature like that
> >Or
> >more accurately, I could understand if you need a 1.15 working
> >copy to
> >enable a WC format that tracks whether or not pristines are
> >available
> >but I would not expect the version number alone to be the factor
> >that
> >makes this decision. Does that make sense?
>
> FWIW, I just assumed that this *isn't* the intended entry point to
> the feature.  That is, it's just how things happen to be on the
> branch right now, but (presumably) Julian isn't saying that he
> thinks this is how users should access the feature in real life.

I also assume that to be the case but want to confirm.

My "assumption" is that the 1.15 WC format includes some new database
indicator(s) that specify whether or not pristines are being stored
but the default 1.15 format would include pristines. There will be
some other option that creates the 1.15 format but with the database
indicator(s) set to indicate that pristines are NOT being stored.

Presumably there will be some new UX as being discussed that
implicitly creates a 1.15 format WC with these indicators set.

So really the only use case for creating a 1.15 format using this more
generic syntax is based on some future version of SVN that lets you
selectively change this setting after a WC is created? Perhaps on a
file/folder by file/folder basis.

Mark

Re: A two-part vision for Subversion and large binary objects.

Posted by Karl Fogel <kf...@red-bean.com>.

On 15 Feb 2022, Mark Phippard wrote:
>On Tue, Feb 15, 2022 at 12:00 PM Julian Foad 
><ju...@apache.org> wrote:
>> Currently: "svn checkout --compatible-version=1.15". No feature 
>> name
>> involved. Not saying that's good, just that's the current 
>> state.
>
>Are you saying this is how you would activate this no-pristines
>feature? If so, that sounds like a poor UX. As a user, I would 
>not
>expect the version number to be connected to a feature like that 
>Or
>more accurately, I could understand if you need a 1.15 working 
>copy to
>enable a WC format that tracks whether or not pristines are 
>available
>but I would not expect the version number alone to be the factor 
>that
>makes this decision. Does that make sense?

FWIW, I just assumed that this *isn't* the intended entry point to 
the feature.  That is, it's just how things happen to be on the 
branch right now, but (presumably) Julian isn't saying that he 
thinks this is how users should access the feature in real life.

>I have not followed every email on this topic but feel like I 
>have
>lost understanding of what the feature will do. I thought the 
>original
>goal was "I have a lot of large binaries, I would like to have a 
>WC
>with no pristines in it"

AFAIU, that's the "MVP" goal here, yup.

>Assuming I have a WC with large binaries:
>
>* I am not going to use diff
>* If I commit a change, I would like to just send the new file to 
>the
>server and let it figure it all out
>* If I revert, yeah I will need a new copy sent to me
>* If I update, and do not have local mods, I will just get a new 
>copy
>of file that replaces what I have
>
>* If I update, and have local mods ... less sure what should 
>happen.
>Is this the scenario where you create a pristine? If it is not a 
>file
>where we can do a text merge, then I guess I would just want my
>version of the file to remain and ideally not even get the new 
>file
>sent to me. If I change my mind, I will do a revert.
>
>Personally, I think a toggle for the whole WC would be fine. And 
>even
>if I have text files too, the handful of operations like diff 
>where it
>downloads a pristine to do the diff might be fine as long as it 
>works.
>Performance might even be fine.
>
>It sounds like this is not how it works?

I'm not sure that the "fetch base from server if necessary" 
behavior is part of the MVP (it might be, I'm just not sure -- 
there are decent workarounds if it's not, after all).
 
Best regards,
-Karl

Re: A two-part vision for Subversion and large binary objects.

Posted by Mark Phippard <ma...@gmail.com>.

On Tue, Feb 15, 2022 at 12:00 PM Julian Foad <ju...@apache.org> wrote:
>
> Karl Fogel wrote:
> > [...] there has to be some way for the user to specify at checkout
> > time [...]
>
> Currently: "svn checkout --compatible-version=1.15". No feature name
> involved. Not saying that's good, just that's the current state.

Are you saying this is how you would activate this no-pristines
feature? If so, that sounds like a poor UX. As a user, I would not
expect the version number to be connected to a feature like that Or
more accurately, I could understand if you need a 1.15 working copy to
enable a WC format that tracks whether or not pristines are available
but I would not expect the version number alone to be the factor that
makes this decision. Does that make sense?

I have not followed every email on this topic but feel like I have
lost understanding of what the feature will do. I thought the original
goal was "I have a lot of large binaries, I would like to have a WC
with no pristines in it"

I thought that was also what the original PoC that Evgeny created did.
He would fetch a pristine if he needed to (such as for revert) but he
would not otherwise store it. Again ... just my understanding and
recollection.

One of your earlier responses seemed to indicate we are creating
pristines and storing them at some point. What would be the scenario
where the original user request would want that?

Assuming I have a WC with large binaries:

* I am not going to use diff
* If I commit a change, I would like to just send the new file to the
server and let it figure it all out
* If I revert, yeah I will need a new copy sent to me
* If I update, and do not have local mods, I will just get a new copy
of file that replaces what I have

* If I update, and have local mods ... less sure what should happen.
Is this the scenario where you create a pristine? If it is not a file
where we can do a text merge, then I guess I would just want my
version of the file to remain and ideally not even get the new file
sent to me. If I change my mind, I will do a revert.

Personally, I think a toggle for the whole WC would be fine. And even
if I have text files too, the handful of operations like diff where it
downloads a pristine to do the diff might be fine as long as it works.
Performance might even be fine.

It sounds like this is not how it works?

Mark

Re: A two-part vision for Subversion and large binary objects.

Posted by Julian Foad <ju...@apache.org>.

Karl Fogel wrote:
> [...] there has to be some way for the user to specify at checkout
> time [...]

Currently: "svn checkout --compatible-version=1.15". No feature name
involved. Not saying that's good, just that's the current state.

> [...] Those are *descriptions* [...]

Yes; hoping to inspire ideas.

Nathan Hartman wrote:
> Remote BASE

That's not bad.

- Julian

Re: A two-part vision for Subversion and large binary objects.

Posted by Nathan Hartman <ha...@gmail.com>.

On Thu, Feb 17, 2022 at 5:09 PM Julian Foad <ju...@apache.org> wrote:
>
> Awesome, Nathan! I was going to say this is clearly a priority. Thanks
> so much for writing that. It is so much easier to iterate on it now you
> have begun. At first glance there is not much I would add or change.

Glad to help!

> Not sure about the word "bare"; for now I'll read it as a place-holder
> for whatever term we agree on.

Agreed.

> Suggested edits below.

Thanks for the feedback!

I think it will be best if I go ahead and create the template 1.15
release notes on site/staging and add there the text we have so far
(including your suggested edits). Then we can hack on it as much as we
want. That will be easier than accumulating too many suggested
improvements in the mailing list, where it is easy to lose track of
them...

Cheers,
Nathan

Linking to the archives (was: Re: A two-part vision for Subversion and large binary objects.)

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Julian Foad wrote on Fri, Feb 18, 2022 at 09:01:27 +0000:
> I asked about this in this thread a few weeks ago; you could see there
> for further discussion. (I tried to dig up a link but having trouble
> finding myself in the archives.)

FWIW, if you have a mail locally, you can pipe it to
<https://svn.apache.org/repos/infra/infrastructure/trunk/projects/asf-generate-mail-archives-link?p=r1078445>
to get a working archive URL for it.  (The first run will be a little
slow since it will do a network access to seed a local cache.)

And if you don't find a mail locally, you can always point to it by its
From/To/Cc/Subject/Date/Message-ID tuple.

Cheers,

Daniel

Re: A two-part vision for Subversion and large binary objects.

Posted by Julian Foad <ju...@apache.org>.

Summary of status of #525
=========================

Currently on the 'pristines-on-demand-on-mwf' branch.

Dev tasks in progress or outstanding:
----------

* Multi-WC-format dependency (https://subversion.apache.org/issue/4883):

    - is merged to trunk and reviewed;

    - some outstanding items from review like UI tweaks, API
private-vs-public choices (in dev@ emails, and #4884 through #4887);

    - some docs probably needed (release notes, help text).

* Per-WC config (https://subversion.apache.org/issue/4889):

    - Quotes from #4889: "not strictly needed for MVP... MIGHT be good
to add the low level flagging mechanism now... not clear which would be
more effort."

    - Last thing I wrote on 2022-02-16 [1]: "I do think we need an
explicit option to enable the feature by name, not just a WC version
number. I haven't yet worked out whether it must also be possible to
upgrade to 1.15 format without enabling the feature, and thus need to
store the feature-enable flag in the WC somewhere separate from the
format version number. For future developments of other wc features,
that will be needed; I just haven't finalised yet if it's essential for
MVP. Might be, in order to not cause compatibility issues for those
future scenarios."

* Issues arising in existing regression tests (#4888 and others):

    - authz denied during textbase sync (#4888) -- in progress [2]

    - about 12 other tests that were disabled or modified -- I have
started investigating and patching; need some further attention.


Community tasks outstanding:
----------------

* initiate a merge to trunk

* decide on a name for the feature


- Julian

[1] dev@ thread "A two-part vision for Subversion and large binary objects."
[2] dev@ thread "Pristines-on-demand: authz denied during textbase sync"

Re: A two-part vision for Subversion and large binary objects.

Posted by Karl Fogel <kf...@red-bean.com>.

On 08 Mar 2022, Daniel Shahaf wrote:
>Karl Fogel wrote on Sun, Mar 06, 2022 at 22:19:50 -0600:
>> b) The failure mode of unnecessary fetching and storing is much 
>> worse than
>> the failure mode of not having fetched a pristine that someone 
>> might turn
>> out to want (there are workarounds for the latter);
>
>What are some of those workarounds?

One can make a copy of a file before modifying it, if one thinks 
one might need to revert or do a local diff (not necessarily 
limited to regular plain-text 'diff' program, of course -- some 
binary file formats have corresponding custom diff tools).

And if one forgets to make a copy first, one can *still* fetch the 
base file manually from the server using 'svn cat', thus paying 
the network-time cost and the local-storage cost by choice.

Finally, eventually we may have an 'svn rehydrate' command (or 
'svn update --rehydrate' or whatever -- I'm not worrying about the 
UI details here, just positing that there *is* a UI).  That would 
do basically what 'svn cat' does, but in addition would integrate 
the result into the working copy as a pristine base.

Best regards,
-Karl

Re: A two-part vision for Subversion and large binary objects.

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Karl Fogel wrote on Sun, Mar 06, 2022 at 22:19:50 -0600:
> b) The failure mode of unnecessary fetching and storing is much worse than
> the failure mode of not having fetched a pristine that someone might turn
> out to want (there are workarounds for the latter);

What are some of those workarounds?

+1 to everything else.  API design is just a game of Simon Says :)

Thanks,

Daniel

Re: A two-part vision for Subversion and large binary objects.

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Karl Fogel wrote on Tue, Mar 08, 2022 at 17:59:20 -0600:
> On 08 Mar 2022, Daniel Shahaf wrote:
> > Sure.  I was asking whether by "once the user has a local pristine" you
> > meant a pristine — as in, a file under .svn/pristine/ that .svn/wc.db
> > knows about and uses — or Alice making a local copy of the contents of
> > file@BASE somewhere libsvn doesn't know about.
> 
> Well, depending on the context, I may be using the word "pristine" flexibly.
> Sometimes I mean a literal integrated-into-wc-metadata pristine, and
> sometimes I just mean "an extra copy of the file, that the user has made
> locally".
> 

I see.

> (It's possible that the degree of precision you would like in this
> sub-discussion is not one I'm willing to adhere to consistently :-).  I
> can't always predict what will matter to a given interlocutor.  But I'll try
> to be sufficiently precise in my responses below at least.)

Thanks, Karl.  I hope I'm not frustrating you.  I do try to be
interoperable with as many interlocutors as possible, but using "foo" to
sometimes mean "bar" and sometimes mean "poor man's alternative to bar"
does in fact create ambiguities.

> > A manual copy of the BASE revision would "serve for local diffs and
> > reverts", indeed, but I would hestitate to recommend this, because diff
> > and revert are both core operations.  If users need to reinvent these
> > two wheels, then:
> > 
> > - All the advantages of having just that one well-known «svn revert»
> >  button that all the users' GUI clients and scripts can press  are lost
> > 
> > - The local disk storage cost will be paid, but without all the
> >  benefits: e.g., commit will use a self-delta rather than a  delta
> >  against BASE even if the file format does lend itself to binary  diffs;
> >  ra_serf's ability to not download a file if the wc has another  file
> >  with the same sha1 won't be used; the keyword-contraction and
> >  diff-ignore-content-type features of «svn diff» will need to be
> >  reimplemented; etc.
> > 
> > - We might leave a bad impression on potential users
> > 
> > As an MVP alternative, some sort of command to hydrate a single file,
> > perhaps, as you have proposed?  CLI-wise, I'll just say we might want to
> > mark such a command as experimental (name it "x-foo" and document it has
> > reduced forward compatibility promises).  Backend-wise, we'll want to
> > ensure a manually-hydrated file doesn't get dehydrated too soon.
> > 
> > What's "too soon"?  Until the user explicitly requests or permits
> > dehydration.  If hydration was manual, so should dehydration be.
> > 
> > Makes sense?
> 
> Yes, thanks for the suggestion, and I agree.  I would love for MVP or MVP+1
> to have an explicit "rehydrate" UI.  I think there *might* be some value to
> shipping MVP without such a feature, in order to first get some real-world
> experience with how people use pristine-less working copies, before we make
> long-lasting UI decisions.
> 
> But anyway, +1 to the general idea.

Filed: https://issues.apache.org/jira/browse/SVN-4894

> > The context of all this is whether 'update' should fetch pristines for
> > modified files.  I guess it should not do so by default (there's no
> > reason to incur the costs, and the user has opted in to
> > pristines-on-demand),
> > but I don't think we should tell users to keep pristines _and not tell
> > libsvn_wc about them_.  The cost of implementing «svn x-hydrate»
> > (however named) is smaller than the cost of asking users to reimplement
> > core version control functionality.
> 
> Users can already copy files behind Subversion's back, of course.
> 
> I'm worried that implementing 'svn x-hydrate' commands now would be
> premature -- we don't know enough about real-world usage yet. I'd feel more
> comfortable putting out one release (of x-hydrate-less MVP) to get feedback
> on pristine-less working copies.  We could even say that we're considering
> adding x-hydrate commands but that we're waiting until the next release so
> we can make sure our UI ideas match people's actual needs.
> 
> Anyone else have thoughts on this?
> 

Just to make sure you noticed I'm proposing this as an x-* command,
i.e., without promising it'll behave in 1.16 as it does in 1.15, or even
exist at all in 1.16?

We could write a Python script to explicitly hydrate something, even
after 1.15.0-GA, to let people experiment with that to some degree.  (It
won't preserves hydration through commits, of course.)

> > This way, by default «commit» will send self-deltas, but if the user
> > wants a pristine for diffs or reverts, then reverts, diffs, and commits
> > will all use the pristine.  There shouldn't be any need for the user to
> > reimplement their own pristine store and their own diff and revert
> > operations.
> > 
> > And yes, commit might not want to use pristines this way, but that's
> > actually a separate feature request: a request to change the "When
> > committing a change to a pristineful file, send a delta against BASE or
> > a self-delta, whichever is smaller" logic, which IIRC works by computing
> > a delta against BASE and comparing its length to the repository-normal
> > filesize, to something that doesn't compute a delta against BASE in the
> > first place.
> 
> Yes, that's a good point (in that last paragraph there), and we should take
> it into account when (re)implementing commit logic.

Filed: https://issues.apache.org/jira/browse/SVN-4895

Cheers,

Daniel

Re: Initial patch for storing the pristines-on-demand setting in the working copy (Issue 4889)

Posted by Julian Foad <ju...@apache.org>.

Evgeny Kotkov wrote:
> first-cut implementation that persists the pristines-on-demand setting

This is great! Thank you Evgeny.

> The patch currently allows doing an `svn checkout --store-pristines=no`,
> which is going to create a working copy that doesn't store the pristine
> copies of the files and fetches them on demand.  The setting is persisted
> in wc.db.
> 
> 
> The patch doesn't include the following:
> 
> 1) An update for the tests and the test suite to run the tests in both modes.

Ack. An option to choose the mode is needed.

> 2) An update for `svn info` to display the value of the new setting.

Ack. That sounds simple enough.

> 3) An ability to take the --store-pristines value from a user config, perhaps
>   on a per-URL basis.

Ack. That can be simply a yes/no global default for starters. That is a
lower priority (low risk, and not blocking the rest of this).

> While working on the patch, I have stumbled across a couple of issues:
> 
> A) `svn upgrade` without arguments fails for a working copy with
> latest format

It would be nice to change it to a no-op with a friendly message. (We
have already discussed reasons why we are keeping the old format as the default.)

> B) Shelving and pristines-on-demand

Indeed shelving is not updated to work with pristines-on-demand. As an
experimental feature probably returning a simple "feature not supported"
error when pristines-on-demand is enabled would be sufficient.

> C) Bumping the related API
> 
>  This part originates from B).  For example, v3 shelving uses the libsvn_wc
>  APIs, such as svn_wc_revert6().  [...] those calls are going to fail [...]
>  [...]
>  Perhaps, we could bump the APIs that currently rely on the text-bases to
>  always be available.  And we could then make their deprecated versions
>  fail (predictably) for working copies that don't store pristine contents.

Any ideas what should the new (bumped) versions do? Just what they do
now, fail if and when pristines are missing? Fetch pristines first? Take
a parameter (or option config struct) telling whether to attempt
hydrating? Something else?

- Julian

Re: Initial patch for storing the pristines-on-demand setting in the working copy (Issue 4889)

Posted by Evgeny Kotkov via dev <de...@subversion.apache.org>.

Evgeny Kotkov <ev...@visualsvn.com> writes:

> While working on the patch, I have stumbled across a couple of issues:
>
> A) `svn upgrade` without arguments fails for a working copy with latest format
>
>   $ svn checkout --compatible-version=1.15 wc
>   $ svn upgrade wc
>   $ svn: E155021: Working copy '…' is already at version 1.15 (format 32)
>     and cannot be downgraded to version 1.8 (format 31)

While the work on pristine checksum kinds is blocked by a veto, I took a
look at other things we'd probably want to handle before the release.

I think that the above case qualifies as such, so I made a related improvement
by introducing a new config option and also fixed the described error:

- r1907964 introduces a new `compatible-version` config setting, to allow
  configuring the desired default wc compatibility level globally.

- r1907965 fixes an error when `svn upgrade` is called without any arguments
  for a working copy of the newer format.

Thanks,
Evgeny Kotkov

Initial patch for storing the pristines-on-demand setting in the working copy (Issue 4889)

Posted by Evgeny Kotkov <ev...@visualsvn.com>.

Julian Foad <ju...@foad.me.uk> writes:

> Issue #4889 "per-WC config" is the subject of Johan's new dev@ post
> "Pristines-on-demand=enabled == format 32?". We already concurred that
> it's wise to decouple "pristines-on-demand mode is enabled in this WC"
> from "the WC format is (at least) 32 so can support that mode".
> <https://subversion.apache.org/issue/4889>. This may be considered
> higher priority than fixing the remaining tests.

I have been thinking about this recently, and here is a patch with the
first-cut implementation that persists the pristines-on-demand setting
in a working copy.  Unfortunately, I am getting a bit swamped with other
things to complete the work on it, but perhaps it could be useful as a
building block for the full implementation.

The patch currently allows doing an `svn checkout --store-pristines=no`,
which is going to create a working copy that doesn't store the pristine
copies of the files and fetches them on demand.  The setting is persisted
in wc.db.

The patch doesn't include the following:

1) An update for the tests and the test suite to run the tests in both modes.

   Personally, I think that we should update the test runner so that it would
   execute the tests for both pristine modes by default, without requiring any
   specific switches.  Because otherwise, there is a chance that one of the
   equally supported core configurations may receive far less attention
   during the development and test runs.

2) An update for `svn info` to display the value of the new setting.

3) An ability to take the --store-pristines value from a user config, perhaps
   on a per-URL basis.

While working on the patch, I have stumbled across a couple of issues:

A) `svn upgrade` without arguments fails for a working copy with latest format

  $ svn checkout --compatible-version=1.15 wc
  $ svn upgrade wc
  $ svn: E155021: Working copy '…' is already at version 1.15 (format 32)
    and cannot be downgraded to version 1.8 (format 31)

  I haven't given it a lot of thought, but we might want to handle this case
  without an error or even think about making `svn upgrade` by default upgrade
  to the latest available version instead of the minimum supported (similar to
  `svnadmin upgrade`).

B) Shelving and pristines-on-demand

  It seems that both v2 and v3 shelving implementations are currently not
  updated to support pristines-on-demand working copies.

C) Bumping the related API

  This part originates from B).  For example, v3 shelving uses the libsvn_wc
  APIs, such as svn_wc_revert6().  If the working copy is created without the
  pristine contents, those calls are going to fail with an error saying that
  there is no text-base for the corresponding path.  This is a tricky error to
  understand, and the failure itself is unpredictable, because it depends on
  whether any of the previous API calls have fetched the missing text-bases.

  So if we think about v3 shelving as an example of the libsvn_wc API user,
  other existing third-party users of the API could face the same problem.

  Perhaps, we could bump the APIs that currently rely on the text-bases to
  always be available.  And we could then make their deprecated versions
  fail (predictably) for working copies that don't store pristine contents.

Thanks,
Evgeny Kotkov

Re: A two-part vision for Subversion and large binary objects.

Posted by Julian Foad <ju...@foad.me.uk>.

Julian Foad wrote:
> Pristines (#525):
>  - #4888 authz denied during textbase sync
>    (an edge case issue, not sure if it's a blocker)
>  - #4889 per-WC config
>    (wanted)
>  - #4891 fix disabled tests
>    (a few different edge cases; much of the analysis is posted in the issue)
> 
> Getting multi-wc-format ready for release (#4883):
>  - #4885 WC upgraded and not-upgraded notifications
>    (still open for some nice-to-haves, but probably done enough for MVP)
>  - #4886 config for default WC version for checkout & upgrade
>    ()
>  - #4887 clarify/unify option names for compatible-version
>    (perhaps change '--compatible-version' to '--wc-compatible-version'
> or '--min-compatible-client')
>  - API review; thread: "multi-wc-format review"
>    (state is APIs are mostly private and a bit messy; not clear what,
> if anything, we would want to change)

Further updates:

#4888: demoted to non-blocker
#4889: blocker, in progress
#4891: blocker, in progress (I've processed a bunch of the sub-issues in it)
#4885: done enough (now non-blocker)
#4886: not sure (currently marked non-blocker)
#4887: not sure (currently marked blocker)
API review: not sure
Merge to trunk: new thread "Pristines-on-demand: OK to merge to trunk?"

In #4891 "fix disabled tests", the remaining sub-issues don't look like
show-stoppers. Likely we will soon demote it to non-blocker.

Issue #4889 "per-WC config" is the subject of Johan's new dev@ post
"Pristines-on-demand=enabled == format 32?". We already concurred that
it's wise to decouple "pristines-on-demand mode is enabled in this WC"
from "the WC format is (at least) 32 so can support that mode".
<https://subversion.apache.org/issue/4889>. This may be considered
higher priority than fixing the remaining tests. I previously drafted a
proof-of-concept for such a config setting. I'm going to spend two or
three hours and see if I can complete an acceptable minimal version of it.

This (#4889) conceptually also relates to #4886 "config for default WC
version for checkout & upgrade"; I am not yet sure if both are
separately necessary.

Two other issues Karl and I discussed were:

* regression tests:
  -> current status is devs need to run test suite both with and without
the new '--wc-format-version=1.15' knob;
  -> this adds another knob to the several existing knobs;
  -> the resulting exponential increase in test runs is a concern but
not a new problem in itself;
  -> we should make build bots run that combination.
  -> Filed as #4898 "Pristines-on-demand: make buildbots test it"

* simplified user documentation:
  -> not sure, maybe existing is sufficient initially (just needs to be
put where users can find it?);
  -> maybe someone else will be able to rewrite into a simpler, more
digestible form?

Re: Getting to first release of pristines-on-demand feature (#525).

Posted by Nathan Hartman <ha...@gmail.com>.

On Sat, Nov 5, 2022 at 6:13 PM Karl Fogel <kf...@red-bean.com> wrote:
>
> Hi, all.  This is a high-level mail in which I try to figure out
> the current status of the issue #525 work and what's left to land
> it in trunk and release it.  Corrections and feedback welcome.

Thanks for the overview and the work already done to make this
possible!

The P-O-D feature itself works.

What's left to do for a first release, IMHO:

(1) Decide on user-facing names for the feature and its command line
switch(es).

(2) Resolve the [TODO] that Karl mentions (decoupling the compatible
version switch from the i525pod switch).

Though there are many other possible enhancements, some of them touched
upon in Karl's message, I think these two items are the only really
crucial ones for a first release.

I have much to say on both of these but I won't go into detail yet
because that would hijack the thread away from the high-level topic of:
what remains to be done for initial viable product? I'd like to give
others a chance to respond before we dive down the rabbit hole. :-)

It's better if each of the above becomes a thread devoted to that
topic.

I'll point out that some initial release note text was drafted at [1].

Cheers,
Nathan

[1] https://subversion-staging.apache.org/docs/release-notes/1.15.html#bare-working-copies

Re: Getting to first release of pristines-on-demand feature (#525).

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Nathan Hartman wrote on Wed, Dec 07, 2022 at 20:29:11 -0500:
> On Wed, Dec 7, 2022 at 12:11 PM Evgeny Kotkov via dev <
> dev@subversion.apache.org> wrote:
> 
> >
> > I think that the `pristines-on-demand-on-mwf` branch is now ready for a
> > merge to trunk.  I could do that, assuming there are no objections.
> 
> 
> 
> I'd like to echo what others have already said by saying a great big THANK
> YOU, to all who have worked on this cool new feature so far!
> 
> I used an earlier incarnation of this branch some months ago in real usage
> scenarios with good results and looking at the recent commit emails as
> they've happened everything looks sensible to me.
> 
> I will try to run the full test suite in the next couple of days and
> assuming the tests pass for me I'll use it as my daily driver to test the
> real usage. Obviously I'll post here if I find anything...
> 
> Meanwhile I'd like to say that on further thought and after reading Johan's
> and Karl's feedback regarding the feature switch naming, I've come around
> to the point of view that --store-pristine={yes|no} is a perfectly fine UI.
> 

Well, if we're bikeshedding anyway, how about --backend-tweaks=without-pristines?
We can support just two values for starters ("without pristines" and
"with pristines"), and have the room to extend this in 1.16, similar to
--trust-server-cert/--trust-server-cert-failures and
--pre-1.4-compatible/--compatible-version.

Similarly, a new config file section with one valid option might make
sense if we anticipate adding more options to that section in the
future.  This way we avoid having the configuration split across two
places.

> Given that this is now the command line switch name, and since users are
> given direct control over the pristinefulness of a WC, and we've been
> calling this feature Pristines On Demand since its inception, I think we
> should finally bless this as the official name of the feature.
> 
> In the next couple of days I plan to update the staged 1.15 release notes,
> which until now tentatively called it Bare Working Copies, to call it
> Pristines On Demand and to complete the description there.
> 
> Regarding the SHA hash question:
> 
> While here, I would like to raise a topic of incorporating a switch from
> > SHA1 to a different checksum type (without known collisions) for the new
> > working copy format.  This topic is relevant to the pristines-on-demand
> > branch, because the new "is the file modified?" check relies on the
> > checksum
> > comparison, instead of comparing the contents of working and pristine
> > files.
> >
> > And so while I consider it to be out of the scope of the
> > pristines-on-demand
> > branch, I think that we might want to evaluate if this is something that
> > should be a part of the next release.
> 
> 
> Is it feasible and would it be beneficial to somehow decouple the hash code
> type from the wc format version? Asking because IIRC the need for a format
> bump to change hashes was one of the reasons it wasn't done a few years ago.

Maybe if we teach f32 to read /two/ new checksum kinds?  E.g., if we
teach f32 to read both SHA-512 and SHA-3, then even if 1.15 f32 writes
SHA-512 by default, it will nevertheless be able to read f32 wc's with
SHA-3 rows that 1.16 might create.

svn_checksum_kind_t's possible values include svn_checksum_fnv1a_32, so
I guess we already support reading wc.db's that use FNV-1a checksums?
(Incidentally, f31 is new in 1.8 whereas svn_checksum_fnv1a_32 is new
in 1.9.)

Cheers,

Daniel

Re: Getting to first release of pristines-on-demand feature (#525).

Posted by Nathan Hartman <ha...@gmail.com>.

On Wed, Dec 7, 2022 at 12:11 PM Evgeny Kotkov via dev <
dev@subversion.apache.org> wrote:

>
> I think that the `pristines-on-demand-on-mwf` branch is now ready for a
> merge to trunk.  I could do that, assuming there are no objections.

I'd like to echo what others have already said by saying a great big THANK
YOU, to all who have worked on this cool new feature so far!

I used an earlier incarnation of this branch some months ago in real usage
scenarios with good results and looking at the recent commit emails as
they've happened everything looks sensible to me.

I will try to run the full test suite in the next couple of days and
assuming the tests pass for me I'll use it as my daily driver to test the
real usage. Obviously I'll post here if I find anything...

Meanwhile I'd like to say that on further thought and after reading Johan's
and Karl's feedback regarding the feature switch naming, I've come around
to the point of view that --store-pristine={yes|no} is a perfectly fine UI.

Given that this is now the command line switch name, and since users are
given direct control over the pristinefulness of a WC, and we've been
calling this feature Pristines On Demand since its inception, I think we
should finally bless this as the official name of the feature.

In the next couple of days I plan to update the staged 1.15 release notes,
which until now tentatively called it Bare Working Copies, to call it
Pristines On Demand and to complete the description there.

Regarding the SHA hash question:

While here, I would like to raise a topic of incorporating a switch from
> SHA1 to a different checksum type (without known collisions) for the new
> working copy format.  This topic is relevant to the pristines-on-demand
> branch, because the new "is the file modified?" check relies on the
> checksum
> comparison, instead of comparing the contents of working and pristine
> files.
>
> And so while I consider it to be out of the scope of the
> pristines-on-demand
> branch, I think that we might want to evaluate if this is something that
> should be a part of the next release.

Is it feasible and would it be beneficial to somehow decouple the hash code
type from the wc format version? Asking because IIRC the need for a format
bump to change hashes was one of the reasons it wasn't done a few years ago.

Cheers,
Nathan

Re: Getting to first release of pristines-on-demand feature (#525).

Posted by Karl Fogel <kf...@red-bean.com>.

On 13 Dec 2022, Evgeny Kotkov wrote:
>Evgeny Kotkov <ev...@visualsvn.com> writes:
>Merged in https://svn.apache.org/r1905955

W00t!!  Thank you, and Julian and Daniel and everyone who's 
contributed to this.

So... do we have a release manager?  :-)

Re: Getting to first release of pristines-on-demand feature (#525).

Posted by Evgeny Kotkov via dev <de...@subversion.apache.org>.

Evgeny Kotkov <ev...@visualsvn.com> writes:

> Merged in https://svn.apache.org/r1905955
>
> I'm going to respond on the topic of SHA1 a bit later.

For the history: thread [1] proposes the `pristine-checksum-salt` branch that
adds the infrastructure to support new pristine checksum kinds in the working
copy and makes a switch to the dynamically-salted SHA1.

From the technical standpoint, I think that it would be better to release
the first version of the pristines-on-demand feature having this branch
merged, because now we rely on the checksum comparison to determine if a
file has changed — and currently it's a checksum kind with known collisions.

At the same time, having that branch merged probably isn't a formal release
blocker for the pristines-on-demand feature.  Also, considering that the
`pristine-checksum-salt` branch is currently vetoed by danielsh (presumably,
for an indefinite period of time), I'd like to note that personally I have
no objections to proceeding with a release of the pristines-on-demand
feature without this branch.

[1] https://lists.apache.org/thread/xmd7x6bx2mrrbw7k5jr1tdmhhrlr9ljc

Regards,
Evgeny Kotkov

Re: Getting to first release of pristines-on-demand feature (#525).

Posted by Evgeny Kotkov via dev <de...@subversion.apache.org>.

Evgeny Kotkov <ev...@visualsvn.com> writes:

> I think that the `pristines-on-demand-on-mwf` branch is now ready for a
> merge to trunk.  I could do that, assuming there are no objections.

Merged in https://svn.apache.org/r1905955

I'm going to respond on the topic of SHA1 a bit later.

Thanks,
Evgeny Kotkov

Re: Getting to first release of pristines-on-demand feature (#525).

Posted by Daniel Sahlberg <da...@gmail.com>.

Evgeny,

Thanks so much for your hard work in pushing this project forward!

I don't think I can contribute much in getting this merged to trunk (from
lack of C experience and lack of time to dig into the inner workings), but
I hope it can be completed!

Kind regards,
Daniel Sahlberg


Den ons 7 dec. 2022 kl 18:10 skrev Evgeny Kotkov via dev <
dev@subversion.apache.org>:

> Evgeny Kotkov <ev...@visualsvn.com> writes:
>
> > > IMHO, once the tests are ready, we could merge it and release
> > > it to the world.
> >
> > Apart from the required test changes, there are some technical
> > TODOs that remain from the initial patch and should be resolved.
> > I'll try to handle them as well.
>
> I think that the `pristines-on-demand-on-mwf` branch is now ready for a
> merge to trunk.  I could do that, assuming there are no objections.
>
>
> https://svn.apache.org/repos/asf/subversion/branches/pristines-on-demand-on-mwf
>
> The branch includes the following:
> – Core implementation of the new mode where required pristines are fetched
>   at the beginning of the operation.
> – A new --store-pristine=yes/no option for `svn checkout` that is persisted
>   as a working copy setting.
> – An update for `svn info` to display the value of this new setting.
> – A standalone test harness that tests main operations in both
>   --store-pristine modes and gets executed on every test run.
> – A new --store-pristine=yes/no option for the test suite that forces all
>   tests to run with a specific pristine mode.
>
> The branch passes all tests in my Windows and Linux environments, in both
> --store-pristine=yes and =no modes.
>
>
> While here, I would like to raise a topic of incorporating a switch from
> SHA1 to a different checksum type (without known collisions) for the new
> working copy format.  This topic is relevant to the pristines-on-demand
> branch, because the new "is the file modified?" check relies on the
> checksum
> comparison, instead of comparing the contents of working and pristine
> files.
>
> And so while I consider it to be out of the scope of the
> pristines-on-demand
> branch, I think that we might want to evaluate if this is something that
> should be a part of the next release.
>
>
> Thanks,
> Evgeny Kotkov
>

Re: Getting to first release of pristines-on-demand feature (#525).

Posted by Karl Fogel <kf...@red-bean.com>.

On 07 Dec 2022, Evgeny Kotkov wrote:
>The branch passes all tests in my Windows and Linux environments, 
>in both
>--store-pristine=yes and =no modes.

FYI, it passes all tests here too (on Debian GNU/Linux, up-to-date 
'testing' distro).  Attached file has details; there were some 
XFAILs, but no FAILs.

Best regards,
-Karl

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format (was: Re: Getting to first release of pristines-on-demand feature (#525).)

Posted by Branko Čibej <br...@apache.org>.

On 20.12.2022 09:14, Evgeny Kotkov wrote:
> 2) We already need a working copy format bump for the pristines-on-demand
>     feature.  So using that format bump to solve the SHA1 issue might reduce
>     the overall number of required bumps for users (assuming that we'll still
>     need to switch from SHA1 at some point later).

Using a new hashing algorithm in the working copy is relatively simple. 
Making such a change backwards-compatible is not. It would be really 
nice if this could be done in a way that allows newer clients to still 
support older working copies without upgrading them; after all, we have 
the infrastructure for this in place now.

-- Brane

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format (was: Re: Getting to first release of pristines-on-demand feature (#525).)

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Evgeny Kotkov via dev wrote on Tue, Dec 20, 2022 at 11:14:00 +0300:
> [Moving discussion to a new thread]
> 
> We currently have a problem that a working copy relies on the checksum type
> with known collisions (SHA1).  A solution to that problem

Why is libsvn_wc's use of SHA-1 a problem?  What's the scenario wherein
Subversion will behave differently than it should?

> is to switch to a different checksum type without known collisions in
> one of the newer working copy formats.

Such as SHA-1 salted by NODES.LOCAL_RELPATH and NODES.WC_ID (or a per-wc UUID)?

> Since we plan on shipping a new working copy format in 1.15, this seems to
> be an appropriate moment of time to decide whether we'd also want to switch
> to a checksum type without known collisions in that new format.
> 

What's the acceptance test we use for candidate checksum algorithms?

You say we should switch to a checksum algorithm that doesn't have known
collisions, but, why should we require that?  Consider the following
160-bit checksum algorithm:
.
    1. If the input consists of 40 ASCII lowercase hex digits and
       nothing else, return the input.
    2. Else, return the SHA-1 of the input.

This algorithm has a trivial first preimage attack.  If a wc used this
identity-then-sha1 algorithm instead of SHA-1, then… what?

> Below are the arguments for including a switch to a different checksum type
> in the working copy format for 1.15:
> 
> 1) Since the "is the file modified?" check now compares checksums, leaving
>    everything as-is may be considered a regression, because it would
>    introduce additional cases where a working copy currently relies on
>    comparing checksums with known collisions.
> 

Well, SHA-1 is still collision-free so long as one is not deliberately
trying to use collisions, so this would only be a regression if we
consider "Deliberately store files that have the same checksum" to be
a use-case.  Do we?

I recall we discussed this when shattered.io was announced, and we
didn't rush to upgrade the checksums we use everywhere, so I guess back
then we came to the conclusion that wasn't a use-case.  (Of course we
can change our opinion; that's just a datapoint, and there may be more,
on both sides, in the old thread.)

I looked for the old thread and didn't find it.  (I looked in the
private@ archives too in case the thread was there.)

> 2) We already need a working copy format bump for the pristines-on-demand
>    feature.  So using that format bump to solve the SHA1 issue might reduce
>    the overall number of required bumps for users (assuming that we'll still
>    need to switch from SHA1 at some point later).
> 

Considering that 1.15 will support reading and writing both f31 and f32,
the "overall number of required bumps" between 1.8 and trunk@HEAD is
zero, meaning the proposed change can't reduce that number.

> 3) While the pristines-on-demand feature is not released, upgrading
>    with a switch to the new checksum type seems to be possible without
>    requiring a network fetch.

I infer the scenario in question here is upgrading a (say) pristinesless
wc to a a newer format that supports a new checksum algorithm.

>    But if some of the pristines are optional, we lose the possibility
>    to rehash all contents in place.  So we might find ourselves having
>    to choose between two worse alternatives of either requiring
>    a network fetch during upgrade or entirely prohibiting an upgrade
>    of working copies with optional pristines.

Why would we want to rehash everything in place?  The 1.15→1.16 upgrade
could simply leave pristineless files' checksums as SHA-1 until the next
«svn up», just like «svnadmin upgrade» of FSFS doesn't retroactively add
SHA-1 checksums to node-rev headers or "-file" or "-dir" indicators in
the changed-paths section.

There may be yet other alternatives.

> Thoughts?

I'm not voting either -0 or +0 at this time.

Cheers,

Daniel

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Karl Fogel wrote on Fri, Jan 20, 2023 at 11:18:56 -0600:
> On 20 Jan 2023, Nathan Hartman wrote:
> > Taking a step back, this discussion started because pristine-free WCs
> > are IIUC more dependent on comparing hashes than pristineful WCs, and
> > therefore a hash collision could have more impact in a pristine-free
> > WC. "Guarantees" were mentioned, but I think it's important to state
> > that there's only a guarantee of probability, since as mentioned above
> > all hashes will have collisions.
> 
> Sure, in a literal mathematical sense, but not in a sense that matters for
> our purposes here.
> 
> In the absence of an intentionally caused collision, a good hash function
> has *far* less chance of accidental collision than, say, the chance that
> your CPU will malfunction due to a stray cosmic ray, or the chance of us
> getting hit by a planet-destroying meteorite tomorrow.
> 
> For our purposes, "guarantee" is accurate.  No guarantee we make can be
> stonger than the inverse probability of a CPU/memory malfunction anyway.
> 

The probability of an accidental collision in a "good" N-bit hash
function is on the order of 1/√2ⁿ, which for sufficiently large N is
considered an acceptable risk.  That's invariant over time, however,
intentionally causing collisions becomes easier over time.

> > We already can't store files with identical SHA1 hashes, but AFAIK the
> > only meaningful impact we've ever heard is that security researchers
> > cannot track files they generate with deliberate collisions. The same
> > would be true with any hash type, for collisions within that hash
> > type.
> 
> Yes.  A hash is considered "broken" the moment security researches can
> generate a collision.
> 

To be clear, is this what you're saying? —
.
    Premise: There is a collision attack against SHA-1.
    Conclusion: Subversion should stop using SHA-1.

This conclusion does not follow from this premise.  For instance, FSFS
checks for collisions, so it can actually use "File length in bytes" as
a checksum and everything would work; the only thing that would change
is that it would not be possible to commit a file that's the same
expanded_size as any other node-rev (including directories).

And, anyway, the burden is not on me to disprove your claim, but on
you to prove it.

> FWIW, in one of my previous posts, I described a real-life scenario in which
> the ability to generate a chosen-plaintext collision in an SVN working copy
> would have security implications.

Yes, and as I have already asked: What other counters to that attack,
besides migrating away from SHA-1, have you considered?  Have you
considered the downsides of migrating away from SHA-1?

Also, /if/ we changed checksums, would that address the attack?  Put
differently, why is a similar attack impossible if we change the
checksum algorithm?  Why is use of SHA-1 a /sine qua non/ of your
scenario?

For example, if we used another checksum algorithm, the attacker from
your scenario might opt to edit the base checksums in .svn/wc.db and
rename the .svn/pristine/ files accordingly.  That's much easier to pull
off, and will be easy to adapt if we change the algorithm again, but on
the other hand, requires write access to the .svn directory and is
easier to discover.

Daniel

> Best regards,
> -Karl

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Karl Fogel wrote on Mon, Jan 30, 2023 at 17:26:03 -0600:
> On 29 Jan 2023, Evgeny Kotkov via dev wrote:
> > I have *absolutely* no idea where "being railroaded through" comes
> > from.  Really, it's a wrong way of portraying and thinking about the
> > events that have happened so far.
> > 
> > Reiterating over those events: I wrote an email containing my
> > thoughts and explaining the motivation for such change.  I didn't
> > reply to some of the questions (including some tricky questions,
> > such as the one featuring a theoretical hash function), because they
> > have been at least partly answered by others in the thread, and I
> > didn't have anything valuable to add at that time.
> > 
> > During that time, I was actively coding the core part of the change,
> > to check if it's possible technically.  Which is important, as far
> > as I believe, because not all theoretically possible solutions can
> > be implemented without facing significant practical or
> > implementation-related issues, and it seems to me that you
> > significantly undervalue such an approach.
> > 
> > I do not say my actions were exemplary, but as far as I can tell,
> > they're pretty much in line with how svn-dev has been operating so
> > far. But, it all resulted in an unclear veto without any _technical_
> > arguments, where what's being vetoed is unclear as well, because the
> > change was not ready at the moment veto got casted.
> > 
> > And because your veto goes in favor of a specific process
> > (considering that no other arguments were given), the only thing
> > that's *actually* being railroaded is an odd form of an RTC
> > (review-then-commit) process that is against our usual CTR
> > (commit-then-review) [1,2].  That's railroading, because it hasn't
> > been explicitly discussed anywhere and a consensus on it has not
> > been reached.
> 
> Daniel, given what's in Evgeny's branch now, could you summarize your
> current technical objections if any?
> 
> If they are something like "This code is solving the wrong problem(s)" or
> "I'm not sure what problem(s) it's supposed to solve", those count as
> technical objections.  It's just that it would be useful to have the
> objection(s) gathered in one place. This thread has been long and somewhat
> digressive -- I'm not saying that's due to you -- and I at least have found
> it a bit difficult to keep track of the concrete objections versus various
> interesting but ultimately theoretical points.
> 

Quoting my other reply just now:

    […] it's pretty simple.  [The OP] said "We should do Y because it
    addresses X".  [The OP] didn't explain why X needs to be addressed, didn't
    consider what alternatives there are to Y, didn't consider any cons that
    Y may have… and when people had questions, [the OP] just began to
    implement Y, without responding to or even acknowledging those
    questions.
    
    That's not how design discussions work.  A design discussion doesn't go
    "state decision; state pros; implement"; it goes "state problem; discuss
    potential solutions, pros, cons; decide; implement" (cf. [4, 5, 6]).
    
    That's why I called veto: not because I considered any particular
    proposal then on the table unreasonable, but because I considered /the
    decision process being used/ unreasonable (cf. [7]).

Concretely: Why would migrating away from SHA-1 be a good thing in the
first place?  Assuming that it /would/ be a good thing, what alternative
ways are there to achieve whatever the goodness may be (new feature /
bugfix / resilience to some attack vector / etc.)?  What are the
potential *downsides* of migrating away from SHA-1?

The same, restated at a higher level of abstraction: "Migrate
away from SHA-1" is a means, not an end.  Define the ends and have
a non-predetermined-outcome discussion on how to achieve them.

"Reduce the security impact to our users of second-preimage attacks
against SHA-1" would be an end.  I don't know whether it's the only one
or whether there are additional ones.

[As to the branch, I'm not sure whether to restate my position on it or
not — so I'll restate it, erring on the side of including too much
rather than too little, but feel free to ignore the following paragraph
at will:]

Was the branch commenced as a PoC / smoke test, to explore one proposed
direction and to be discarded if the consensus compass should end up
pointing towards another cardinal direction?  Or was it commenced on the
assumption that consensus on migrating to SHA-1 to SHA-256 went without
saying, had already formed, or would necessarily have formed by 1.15.0-rc1?

> The reason I'm supportive of Evgeny's direction is that his changes, if
> completed, would offer a solution to the (admittedly still somewhat distant)
> security concern I raised early on. Essentially, I'm worried that
> second-preimage attacks on SHA-1 are coming eventually (maybe I'm wrong
> about this -- they are after all significantly harder than mere collision
> attacks).  *If* such attacks become possible, then our WC could report a
> file as unmodified when in fact it is modified, which would have real
> security implications, as I outlined.
> 

I take it you're referring to this:

    https://mail-archives.apache.org/mod_mbox/subversion-dev/202212.mbox/%3C87k02dr4mn.fsf%40red-bean.com%3E
    I have put WordPress installations under Subversion version control before.
    Once, I detected an attack on one of those WordPress servers when one of the
    things the attacker did was modify some of the WordPress scripts on the
    server.  Those files showed up as modified when I ran 'svn st', and from
    there I ran 'svn diff' and figured out what had happened.  But a
    super-careful attacker could make modifications that leave the
    version-controlled files with the same SHA1 hash they had before, thus
    making it harder to detect the attack.
    
    Yes, I realize there are other ways to detect modifications, and that random
    attackers are unlikely to take the trouble to preserve hashes.  On the other
    hand, a well-resourced spear-fishing attacker who knows something about the
    usage of SVN at their target might indeed try a hash-preserving approach to
    breaking in. The point is, if we're counting on the hashes having certain
    semantics, then our users are counting on it too.  If SHA1 no longer has
    those semantics, we should upgrade.

I offered one alternative counter to that here:

    https://mail-archives.apache.org/mod_mbox/subversion-dev/202301.mbox/%3Cadacbb6f-e0cb-4e5b-8603-0eda19f93b3c%40app.fastmail.com%3E
    So, suppose the wc didn't hardcode _any particular_ hash function for
    naming pristines and for status walks — not md5, not sha1, not sha256 —
    but had each «svn checkout» run pick a hash function uniformly at random
    out of a large enough family of hash functions[1].  (Intuitively, think
    of a family of hash functions as a hash function with a random salt,
    similar to [2].)
    
    This way, even if someone tried to deliberately create a collision, they
    wouldn't be able to pick a collision "off the shelf", as with
    shattered.io; they'd need to compute a collision for the specific hash
    function ("salt") used by that particular wc.  That's more difficult than
    creating a collision in a well-known hash function, regardless of
    whether we treat the salt's value as a secret of the wc (as in, stored
    in a mode-0400 file in under .svn directory and not disclosed to the
    server) or as a value the attacker is assumed to know.
    
    So, that's one way to address [the WordPress scenario].

And analysed the marginal attack difficulty if we change the checksum
algorithm here:

    https://mail-archives.apache.org/mod_mbox/subversion-dev/202301.mbox/%3C20230121102455.GB3174%40tarpaulin.shahaf.local2%3E
    For example, if we used another checksum algorithm, the attacker from
    your scenario might opt to edit the base checksums in .svn/wc.db and
    rename the .svn/pristine/ files accordingly.  That's much easier to pull
    off, and will be easy to adapt if we change the algorithm again, but on
    the other hand, requires write access to the .svn directory and is
    easier to discover.

In any case, even assuming second-preimage attacks against SHA-1 are
something we should assume adversaries capable of [and I'm not
expressing any opinion on this question], it does not /automatically/
follow that we should migrate away from SHA-1:

    https://mail-archives.apache.org/mod_mbox/subversion-dev/202301.mbox/%3C20230121102455.GB3174%40tarpaulin.shahaf.local2%3E
    To be clear, is this what you're saying? —
    .
        Premise: There is a collision attack against SHA-1.
        Conclusion: Subversion should stop using SHA-1.

    This conclusion does not follow from this premise.  For instance, FSFS
    checks for collisions, so it can actually use "File length in bytes" as
    a checksum […]

And to be clear: I'm not saying Subversion should continue using SHA-1,
and I'm not saying that Subversion should stop using SHA-1.  I'm saying
we should consider what the alternatives to that are.

> Like I said, this is far from urgent, and IMHO it certainly should not delay
> a release of our new pristineless feature.  But when and if Evgeny's branch
> is ready (where "ready" presumably includes something other than salted
> SHA-1 as the other checksum option), I would like to see these changes go
> in, unless we identify some harm from them.
> 
> For everyone's ease of reference:
> 
> $ svn cat https://svn.apache.org/repos/asf/subversion/branches/pristine-checksum-kind/BRANCH-README
> 
> $ svn log --stop-on-copy
> https://svn.apache.org/repos/asf/subversion/branches/pristine-checksum-kind/
> 
> Best regards,
> -Karl

Thanks for allowing me the time to write a proper response :)

Daniel

Glossary of attacks (was: Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format)

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Definitions of attacks:

1. Collision attack:
   Given h(),
   find x₁, x₂ such that h(x₁) == h(x₂).

2. Second preimage attack:
   Given h() and x,
   find x′ such that h(x) == h(x′).

3. First preimage attack:
   Given h() and y,
   find x such that h(x) == y.

4. Chosen prefix attack:
   Given h(), p₁, and p₂,
   find m₁, m₂ such that h(m₁) == h(m₂) and m₁.startswith(p₁) and m₂.startswith(p₂).

Daniel Shahaf wrote on Thu, Jan 26, 2023 at 09:33:59 +0000:
> Evgeny Kotkov via dev wrote on Mon, Jan 23, 2023 at 02:28:50 +0300:
> > However, with the feasibility of chosen-prefix attacks on SHA-1 [2], it's
> > probably only a matter of time until the situation becomes worse.
> > 
> 
> Quoting the third hunk of 
> <https://mail-archives.apache.org/mod_mbox/subversion-dev/202212.mbox/%3C20221220201300.GH32332%40tarpaulin.shahaf.local2%3E>:
> 
>     What's the acceptance test we use for candidate checksum algorithms?
>     
>     You say we should switch to a checksum algorithm that doesn't have known
>     collisions, but, why should we require that?  Consider the following
>     160-bit checksum algorithm:
>     .
>         1. If the input consists of 40 ASCII lowercase hex digits and
>            nothing else, return the input.
>         2. Else, return the SHA-1 of the input.
>     
>     This algorithm has a trivial first preimage attack.  If a wc used this
>     identity-then-sha1 algorithm instead of SHA-1, then… what?
> 
> > That could happen after a public disclosure of a pair of executable
> > files/scripts where the forged version allows for remote code execution.
> > Or maybe something similar with a file format that is often stored in
> > repositories and that can be executed or used by a build script, etc.
> > 
> 
> Err, hang on.  Your reference described a chosen-prefix attack, while
> this scenario concerns a single public collision.  These are two
> different things.
> 
> Disclosure of of a pair of executable files/scripts isn't by itself
> a problem unless one of the pair ("file A") is in a repository
> somewhere.  Now, was the colliding file ("file B") generated _before_ or
> _after_ file A was committed?
> 
> - If _before_, then it would seem Mallory had somehow managed to:
> 
>   1. get a file of his choosing committed to Alice's repository; and
> 
>   2. get a wc of Alice's repository into one of the codepaths that
>      assume SHA-1 is one-to-one / collission-free (currently that's the
>      ra_serf optimization and the 1.15 wc status).
> 
>   Now, step #1 seems plausible enough.  As to step #2, it's not clear to
>   me how file B would reach the wc in step #2… but insofar as security
>   assumptions go, it seems reasonable to assume Mallory can make this
>   happen.
> 
>   So, I agree it's a scenario we should address.  What options do we
>   have to address it?  (I grant that migrating away from SHA-1 is one
>   option.)
> 
> - If _after_, then you're presuming not simply a collision attack but
>   a second preimage attack.  Should we assume Mallory to be able to
>   mount a second preimage attack?
> 
> Chosen-prefix collision attacks can help Mallory in a variant of the
> "before" case: Mallory computes a collision, sends file A to Alice (who
> commits it), and invokes his assumed ability to inject file B into
> Alice's wc.  This would work for file formats that ignore the unchosen
> suffix.

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Evgeny Kotkov via dev wrote on Sun, Jan 29, 2023 at 16:36:12 +0300:
> Daniel Shahaf <d....@daniel.shahaf.name> writes:
> 
> > > That could happen after a public disclosure of a pair of executable
> > > files/scripts where the forged version allows for remote code execution.
> > > Or maybe something similar with a file format that is often stored in
> > > repositories and that can be executed or used by a build script, etc.
> > >
> >
> > Err, hang on.  Your reference described a chosen-prefix attack, while
> > this scenario concerns a single public collision.  These are two
> > different things.
> 
> A chosen-prefix attack allows finding more meaningful collisions such as
> working executables/scripts.  When such collisions are made public, they
> would have a greater exploitation potential than just a random collision.
> 

Right.  So we're assuming Mallory generates a chosen-prefix collision,
and then somehow pulls off steps #1 and #2-as-amended [both quoted
below], with Alice noticing none of that.

That still sounds like something we should assume Mallory can pull off.

> > Disclosure of of a pair of executable files/scripts isn't by itself
> > a problem unless one of the pair ("file A") is in a repository
> > somewhere.  Now, was the colliding file ("file B") generated _before_ or
> > _after_ file A was committed?
> >
> > - If _before_, then it would seem Mallory had somehow managed to:
> >
> >   1. get a file of his choosing committed to Alice's repository; and
> >
> >   2. get a wc of Alice's repository into one of the codepaths that
> >      assume SHA-1 is one-to-one / collission-free (currently that's the
> >      ra_serf optimization and the 1.15 wc status).
> 
> Not only.  There are cases when the working copy itself installs the working
> file with a hash lookup in the pristine store.  This is more true for 1.14
> than trunk, because in trunk we have the streamy checkout/update that avoid
> such lookups by writing straight to the working file.  However, some of
> the code paths still install the contents from the pristine store by hash.
> Examples include reverting a file, copying an unmodified file, switching
> a file with keywords, the mentioned ra_serf optimization, and etc.
> 

Thanks.  In terms of that step #2, all these are also candidates for
"one of the codepaths", then.

> >   Now, step #1 seems plausible enough.  As to step #2, it's not clear to
> >   me how file B would reach the wc in step #2…
> 
> If Mallory has write access, she could commit both files, thus arranging for
> a possible content change if both files are checked out to a single working
> copy.  This isn't the same as just directly modifying the target file, because
> file content isn't expected to change due to changes in other files (that can
> be of any type), so this attack has much better chances of being unnoticed.
> 

Well, yes, but the write access requirement lowers severity.

> If Mallory doesn't have write access, there should be other vectors, such
> as distributing a pair of files (harmless in the context of their respective
> file formats) separately via two upstream channels.  Then, if both of the
> upstream distributions are committed into a repository and their files are
> checked out together, the content will change, allowing for a malicious
> action.

I take it we're still under the assumption that someone's repository has
rep-sharing disabled (or unsupported, i.e., pre-1.6 format) despite the
recommendation in security/sha1-advisory.txt, since otherwise the commit
would be rejected.

So, back to my question which you have snipped:

> >   So, I agree it's a scenario we should address.  What options do we
> >   have to address it?  (I grant that migrating away from SHA-1 is one
> >   option.)

Care to address that?

Daniel

> 
> Regards,
> Evgeny Kotkov

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Evgeny Kotkov via dev <de...@subversion.apache.org>.

Daniel Shahaf <d....@daniel.shahaf.name> writes:

> Look, it's pretty simple.  You said "We should do Y because it
> addresses X".  You didn't explain why X needs to be addressed, didn't
> consider what alternatives there are to Y, didn't consider any cons that
> Y may have… and when people had questions, you just began to
> implement Y, without responding to or even acknowledging those
> questions.
>
> That's not how design discussions work.  A design discussion doesn't go
> "state decision; state pros; implement"; it goes "state problem; discuss
> potential solutions, pros, cons; decide; implement" (cf. [4, 5, 6]).

Well, I think it may not be as simple as it seems to you.  Who decided that
we should follow the process you're describing?  Is there a thread with a
consensus on this topic?  Or do you insist on using this specific process
because it's the only process that seems obvious to you?  What alternatives
to it have been considered?

As far as I can tell, the process you're suggesting is effectively a
waterfall-like process, and there are quite a lot of concerns about its
effectiveness, because the decisions have to be made in the conditions of
a lack of information.

Personally, I prefer an alternative process that starts from finding out
all available bits of information, which are then used to make informed
decisions.  The unfortunate reality, however, is that the only guaranteed
way of collecting all information means implementing all (or almost all)
significant parts in code.  Roughly speaking, this process looks like a
research project that gets completed by trial and error.

Based on what you've been saying so far, I wouldn't be surprised if you
disagree.  But I still think that forcing the others to follow a certain
process by such means as vetoing a code change is maybe a bit over the
top.  (In the meantime, I certainly won't object if you're going to use this
waterfall-like process for the changes that you implement yourself.)

Regards,
Evgeny Kotkov

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Evgeny Kotkov via dev wrote on Sun, Jan 29, 2023 at 16:37:20 +0300:
> Daniel Shahaf <d....@daniel.shahaf.name> writes:
> 
> > > (I'm not saying that the above rules have to be used in this particular case
> > >  and that a veto is invalid, but still thought it’s worth mentioning.)
> > >
> >
> > I vetoed the change because it hadn't been designed on the dev@ list,
> > had not garnered dev@'s consensus, and was being railroaded through.
> > (as far as I could tell)
> 
> I have *absolutely* no idea where "being railroaded through" comes from.
> Really, it's a wrong way of portraying and thinking about the events that have
> happened so far.
> 
> Reiterating over those events: I wrote an email containing my thoughts
> and explaining the motivation for such change.  I didn't reply to some of
> the questions (including some tricky questions, such as the one featuring
> a theoretical hash function), because they have been at least partly
> answered by others in the thread, and I didn't have anything valuable
> to add at that time.
> 
> During that time, I was actively coding the core part of the change,
> to check if it's possible technically.  Which is important, as far as
> I believe, because not all theoretically possible solutions can be implemented
> without facing significant practical or implementation-related issues, and
> it seems to me that you significantly undervalue such an approach.
> 

Quoting myself from elsethread: [3]

    - If the branch is seen and presented as a PoC for furthering discussion
      and for discovering practical considerations (e.g., that
      PRISTINE.MD5_CHECKSUM docstring I found yesterday during discussion,
      or the ra_serf sha1 optimization that anyone implementing the branch
      would run into), it's likely a good thing.

> I do not say my actions were exemplary, but as far as I can tell, they're
> pretty much in line with how svn-dev has been operating so far.  But, it all
> resulted in an unclear veto without any _technical_ arguments, where what's
> being vetoed is unclear as well, because the change was not ready at the
> moment veto got casted.
> 

Look, it's pretty simple.  You said "We should do Y because it
addresses X".  You didn't explain why X needs to be addressed, didn't
consider what alternatives there are to Y, didn't consider any cons that
Y may have… and when people had questions, you just began to
implement Y, without responding to or even acknowledging those
questions.

That's not how design discussions work.  A design discussion doesn't go
"state decision; state pros; implement"; it goes "state problem; discuss
potential solutions, pros, cons; decide; implement" (cf. [4, 5, 6]).

That's why I called veto: not because I considered any particular
proposal then on the table unreasonable, but because I considered /the
decision process being used/ unreasonable (cf. [7]).

> And because your veto goes in favor of a specific process

Yes, I'm arguing in favour of first defining a problem, then considering
solutions to it, both their pros and cons, and only then deciding what
to implement.  This process isn't unique, novel, or singular; it's
standard in multiple disciplines [4–7].

>                                                           (considering that
> no other arguments were given), the only thing that's *actually* being
> railroaded is an odd form of an RTC (review-then-commit) process that is
> against our usual CTR (commit-then-review) [1,2].  That's railroading,
> because it hasn't been explicitly discussed anywhere and a consensus
> on it has not been reached.

This thread was started on 2022-12-20 [1], with the idiomatic
"Thoughts?" sign-off.  The first relevant code was committed on
2023-01-19 [2].

That is: the change followed RTC to begin with.  Considering that both
[1] and [2] were authored by you personally, I find it difficult to
charitably interpret your claim that "an odd form of [RTC]" was being
"railroaded", as RTC rather than "our usual CTR [process]" was being
followed at your own decision.

It's perhaps worth pointing out the veto followed the branch creation
because that was the point when I gave up on waiting for someone to
respond to the objections that had been made by then.  It wasn't a veto
on using a branch, as I have clarified: [3]

    I didn't object to the use of a branch /per se/.  I objected to the
    treating of objections that *had already been posted* as though they had
    never been posted.  *That's* not acceptable.

So, no, I wasn't advocating /either/ RTC or CTR; I was advocating that
the "R" step happen at all.  A branch may take place before, during, or
after discussion — see [3] for more — but the important thing is that
discussion happen.  The OP doesn't have to agree with all points made,
but doesn't get to ignore them and proceed as though they have never
been posted.

Daniel

[1] https://mail-archives.apache.org/mod_mbox/subversion-dev/202212.mbox/%3CCAP_GPNh2erpHzP0umxV_MuZRXKCkW_n8gJEGsM4aafqcKk02RQ%40mail.gmail.com%3E
[2] r1906817
[3] https://mail-archives.apache.org/mod_mbox/subversion-dev/202301.mbox/%3C20230121092231.GA3174%40tarpaulin.shahaf.local2%3E
[4] https://skybrary.aero/articles/dec
[5] http://paulgraham.com/essay.html under the second and third headings
[6] https://xyproblem.info/
[7] the Business Judgment Rule

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Branko Čibej <br...@apache.org>.

On 18.01.2024 08:43, Daniel Sahlberg wrote:
> As far as I understand, the point of multi-hash is to keep the WC 
> format between versions (so older clients can continue to use the WC). 
> I need some help to understand how that would work in practice. Let's 
> say that 1.15 adds SHAABC, 1.16 adds SHAXYZ. Then 1.17 drops SHA1. But...
> - A 1.17 client will only use SHAABC or SHAXYZ hashes.
> - A 1.16 client can use SHA1, SHAABC and SHAXYZ hashes.
> - A 1.15 client can only use SHA1 and SHAABC hashes.
>
> How can these work together? A WC created in 1.17 can't be used by a 
> 1.15 client and a WC created in 1.15 (with SHA1) can't be used by a 
> 1.17 client. How is this different from bumping the format? How do we 
> detect this?

It's just another dimension of changing the format. When you introduce 
multihash, you have to bump the format number so that clients that don't 
know about it won't try to use the WC. Clients that _do_ know about it 
will have to check which hash algorithm(s) are used in any case.

> At least, we'd need some method of updating the hashes in the 
> database, akin the WC format upgrades in some versions (was it 1.8?).

"svn upgrade" is where this would happen. On the multi-wc-format branch 
(if memory serves), it accepts a target WC version -- which is 
equivalent to the feature set supported by the WC. There's no reason why 
it couldn't also grow a "--force-hash=quantum-entangled" option.

-- Brane

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Nathan Hartman <ha...@gmail.com>.

On Sat, Jan 13, 2024 at 3:56 PM Nathan Hartman <ha...@gmail.com>
wrote:

> Pros: Future-proofing against the real and perceived brokenness of any
> hash types.
>

I meant to write:

Pros: Future-proofing against the real and perceived brokenness of any hash
types, or the deprecation and later removal of their implementations from
our deps.

Cheers,
Nathan

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Evgeny Kotkov via dev <de...@subversion.apache.org>.

Daniel Sahlberg <da...@gmail.com> writes:

> As far as I understand, the point of multi-hash is to keep the WC format
> between versions (so older clients can continue to use the WC).

Just as a minor note, the working copies created using the implementation
on the `pristine-checksum-salt` branch don't multi-hash the contents, but
rather make the [single] used checksum kind configurable and persist it at
the moment when a working copy is created or upgraded.

> I need some help to understand how that would work in practice. Let's say
> that 1.15 adds SHAABC, 1.16 adds SHAXYZ. Then 1.17 drops SHA1. But...
> - A 1.17 client will only use SHAABC or SHAXYZ hashes.
> - A 1.16 client can use SHA1, SHAABC and SHAXYZ hashes.
> - A 1.15 client can only use SHA1 and SHAABC hashes.
>
> How can these work together? A WC created in 1.17 can't be used by a 1.15
> client and a WC created in 1.15 (with SHA1) can't be used by a 1.17 client.
> How is this different from bumping the format? How do we detect this?

In the current design available on the `pristine-checksum-salt` branch, the
supported checksum kinds are tied to a working copy format, and any supported
checksum kind may additionally use a dynamic salt.  For example, format 33
supports only SHA-1 (regular or dynamically salted), but a newer format 34
can add support for another checksum kind such as SHA-2 if necessary.

When an existing working copy is upgraded to a newer format, its current
checksum kind is retained as is (we can't rehash the content in a
`--store-pristine=no` case because the pristines are not available).

I don't know if we'll find ourselves having to forcefully phase out SHA-1
*even* for such working copies that retain an older checksum kind, i.e.,
it might be enough to use the new checksum kind only for freshly created
working copies.  However, there would be a few options to consider:

I think that milder options could include warning the user to check out a
new working copy (that would use a different checksum kind), and a harsher
option could mean adding a new format that doesn't support SHA-1 under
any circumstances, and declaring all previously available working copy
formats unsupported.

Regards,
Evgeny Kotkov

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Daniel Sahlberg <da...@gmail.com>.

@Karl Fogel <kf...@red-bean.com>,  @Evgeny Kotkov
<ev...@visualsvn.com>

Any chance for a comment on the questions in this thread?

I've also added my own comment below.

Kind regards,
Daniel



Den sön 14 jan. 2024 kl 00:56 skrev Nathan Hartman <hartman.nathan@gmail.com
>:

> On Fri, Jan 12, 2024 at 3:51 PM Johan Corveleyn <jc...@gmail.com> wrote:
>
>> On Fri, Jan 12, 2024 at 12:37 PM Daniel Shahaf <d....@daniel.shahaf.name>
>> wrote:
>> ...
>> > Procedurally, the long hiatus is counterproductive.  Neither kfogel nor
>> > I had the context in our heads, and the cache misses took their toll in
>> > tuits and in wallclock time.  Furthermore, I have less spare time for
>> > dev@ discussions than I did when I cast the veto (= a year ago next
>> > Saturday).  Going forward it might be preferable for threads not to
>> > hibernate.
>>
>> I agree, but obviously the hibernation is not some deliberate action
>> by anyone. It's just that most of us here have less spare time for
>> dev@ discussions (and for SVN development) than before. Especially for
>> such complex matters, and especially when people feel there are
>> walking into a minefield. There are only a few active devs left, and
>> tuits are running low ...
>>
>> ...
>> > That being the case, I have considered whether merging the feature
>> > branch outweighs letting dev@ take a not-only-/pro forma/ role in
>> > design discussions.  I am of the opinion that it does not, and
>> > therefore I reäfirrm the veto.
>>
>> It has become more clear to me (I was only following tangentially)
>> that your veto is focused on the development methodology and the lack
>> of design discussion. Is that a valid reason for a veto? We are low on
>> resources, someone still finds time to make some progress, no one
>> blocks it on technical grounds, and then someone vetoes it because we
>> don't have enough resources?
>>
>> That puts us pretty much in deadlock, because we are too low on
>> resources. Or maybe I misunderstand?
>>
>> To be clear: I appreciate your input, Daniel, and your insistence on a
>> more thorough design discussion. I assume it's coming from a genuine
>> concern that we formulate problems well, and think hard about possible
>> solutions (focusing on the precise problem we are trying to solve).
>> But at the end of the day, if that design discussion doesn't happen
>> (or not enough to your satisfaction anyway), is that grounds for a
>> veto? For me it's a tough call, because on the one hand you have a
>> point, but on the other hand ... you're blocking _some_ progress
>> because the process behind it is not perfect (which is hard to do with
>> the 3.25 tuits we have left).
>>
>> > P.S.  Could that BRANCH-README please state what's the problem the
>> branch
>> > means to solve, i.e., the goal / acceptance test?  "Make it possible to
>> > «svn add» SHA-1 collisions"?
>>
>> I agree that would be a good step.
>>
>> I too find it a bit unclear what problem we're actually trying to
>> solve, apart from a vague feeling that SHA-1 will become more and more
>> broken over time, and that this will cause fatal injury to SVN (in its
>> WC, protocol, dump format, or repository). And perhaps the fact that
>> security auditors are becoming more and more triggered by seeing SHA-1
>> (even if they don't understand the way it is used and its
>> ramifications). Making it possible to 'svn add' SHA-1 collisions is
>> not it, I think.
>>
>> --
>> Johan
>>
>
>
> Johan's reply sums up my thoughts pretty closely.
>
> I would very much like to *avoid* all of the following: deadlock, bad
> feelings, and members of this small community leaving because of deadlocks
> or bad feelings.
>
> I agree that (at the very least), BRANCH-README should define what problem
> the branch aims to solve, and perhaps that's really the main thing we need
> to discuss and resolve.
>
> Johan touched on one issue with SHA1: regardless how it is actually used
> in SVN and whether it is adequate for those purposes, there is customer
> perception. I can imagine, for example, the IT dept of some big
> $corporation could blacklist SHA1 because it is considered broken for
> cryptographic purposes. But they could blacklist it for everything. Even
> though it is safe and effective for our use cases, try explaining that to
> an admin who is struggling to meet such a blanket policy.
>
> I would like to add another reason to think about a post-SHA1 future: I'm
> writing on mobile so I can't easily grep for things now, but could our
> dependencies eventually remove the SHA1 implementation? (I just saw
> something about removal of DSA from some famous lib not too long ago. SHA1
> could be next?)
>
> When would SHA1 disappear? I don't know, but I consider it plausible to
> happen in about 5 years.
>
> If SHA1 is removed in the future, there will need to be a mad dash to
> replace it. Or we'll have to add a new dependency to use an alternate
> implementation. Or we'll have to implement our own SHA1 or copy some code
> into SVN. All of these seem bad to me.
>
> Switching to a different hash is also a bad idea, I think, because it is
> likely to suffer the same problems as SHA1 later on, as cryptography
> research proceeds and newer hashes become declared broken.
>
> I'll try to describe what I think is a best case scenario: Support
> multi-hash in 1.15 in format 32 WCs. SHA1 can continue to be the default
> but we should be careful not to require a SHA1 implementation to exist.
> Furthermore, by default "svn checkout" continues to create format 31 WCs
> (this is implemented currently). When new (1.15 and up) servers talk to new
> clients, they'll have to negotiate the "best" common hash for the protocol.
> Over time, we can add other hashes. Over time, distros and package managers
> pick up 1.15. Someday down the line (5 years?), if SHA1 goes away, or an IT
> dept wants to avoid SHA1 for whatever reasons, most of the hard work of
> changing hashes will have been done already and most people will have the
> newer software on their system already. Changing hashes then becomes a
> trivial matter. The same will be true of any future hashes that become
> declared broken, requiring almost no additional work on our part. Notably,
> it will not be necessary to bump the WC or protocol formats because of
> hashes.
>

> Pros: Future-proofing against the real and perceived brokenness of any
> hash types.
>
> Cons: Requires a lot of work up front, which no one might volunteer to do.
>
> We should continue hashing out (pun intended) how to address the different
> concerns raised.
>
> Are there any technical reasons *not* to support other hashes going
> forward?
>
> Are there other pros or cons to supporting a scenario like I described?
>

As far as I understand, the point of multi-hash is to keep the WC format
between versions (so older clients can continue to use the WC). I need some
help to understand how that would work in practice. Let's say that 1.15
adds SHAABC, 1.16 adds SHAXYZ. Then 1.17 drops SHA1. But...
- A 1.17 client will only use SHAABC or SHAXYZ hashes.
- A 1.16 client can use SHA1, SHAABC and SHAXYZ hashes.
- A 1.15 client can only use SHA1 and SHAABC hashes.

How can these work together? A WC created in 1.17 can't be used by a 1.15
client and a WC created in 1.15 (with SHA1) can't be used by a 1.17 client.
How is this different from bumping the format? How do we detect this?

At least, we'd need some method of updating the hashes in the database,
akin the WC format upgrades in some versions (was it 1.8?).

Kind regards,
Daniel

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Nathan Hartman <ha...@gmail.com>.

On Fri, Jan 12, 2024 at 3:51 PM Johan Corveleyn <jc...@gmail.com> wrote:

> On Fri, Jan 12, 2024 at 12:37 PM Daniel Shahaf <d....@daniel.shahaf.name>
> wrote:
> ...
> > Procedurally, the long hiatus is counterproductive.  Neither kfogel nor
> > I had the context in our heads, and the cache misses took their toll in
> > tuits and in wallclock time.  Furthermore, I have less spare time for
> > dev@ discussions than I did when I cast the veto (= a year ago next
> > Saturday).  Going forward it might be preferable for threads not to
> > hibernate.
>
> I agree, but obviously the hibernation is not some deliberate action
> by anyone. It's just that most of us here have less spare time for
> dev@ discussions (and for SVN development) than before. Especially for
> such complex matters, and especially when people feel there are
> walking into a minefield. There are only a few active devs left, and
> tuits are running low ...
>
> ...
> > That being the case, I have considered whether merging the feature
> > branch outweighs letting dev@ take a not-only-/pro forma/ role in
> > design discussions.  I am of the opinion that it does not, and
> > therefore I reäfirrm the veto.
>
> It has become more clear to me (I was only following tangentially)
> that your veto is focused on the development methodology and the lack
> of design discussion. Is that a valid reason for a veto? We are low on
> resources, someone still finds time to make some progress, no one
> blocks it on technical grounds, and then someone vetoes it because we
> don't have enough resources?
>
> That puts us pretty much in deadlock, because we are too low on
> resources. Or maybe I misunderstand?
>
> To be clear: I appreciate your input, Daniel, and your insistence on a
> more thorough design discussion. I assume it's coming from a genuine
> concern that we formulate problems well, and think hard about possible
> solutions (focusing on the precise problem we are trying to solve).
> But at the end of the day, if that design discussion doesn't happen
> (or not enough to your satisfaction anyway), is that grounds for a
> veto? For me it's a tough call, because on the one hand you have a
> point, but on the other hand ... you're blocking _some_ progress
> because the process behind it is not perfect (which is hard to do with
> the 3.25 tuits we have left).
>
> > P.S.  Could that BRANCH-README please state what's the problem the branch
> > means to solve, i.e., the goal / acceptance test?  "Make it possible to
> > «svn add» SHA-1 collisions"?
>
> I agree that would be a good step.
>
> I too find it a bit unclear what problem we're actually trying to
> solve, apart from a vague feeling that SHA-1 will become more and more
> broken over time, and that this will cause fatal injury to SVN (in its
> WC, protocol, dump format, or repository). And perhaps the fact that
> security auditors are becoming more and more triggered by seeing SHA-1
> (even if they don't understand the way it is used and its
> ramifications). Making it possible to 'svn add' SHA-1 collisions is
> not it, I think.
>
> --
> Johan
>

Johan's reply sums up my thoughts pretty closely.

I would very much like to *avoid* all of the following: deadlock, bad
feelings, and members of this small community leaving because of deadlocks
or bad feelings.

I agree that (at the very least), BRANCH-README should define what problem
the branch aims to solve, and perhaps that's really the main thing we need
to discuss and resolve.

Johan touched on one issue with SHA1: regardless how it is actually used in
SVN and whether it is adequate for those purposes, there is customer
perception. I can imagine, for example, the IT dept of some big
$corporation could blacklist SHA1 because it is considered broken for
cryptographic purposes. But they could blacklist it for everything. Even
though it is safe and effective for our use cases, try explaining that to
an admin who is struggling to meet such a blanket policy.

I would like to add another reason to think about a post-SHA1 future: I'm
writing on mobile so I can't easily grep for things now, but could our
dependencies eventually remove the SHA1 implementation? (I just saw
something about removal of DSA from some famous lib not too long ago. SHA1
could be next?)

When would SHA1 disappear? I don't know, but I consider it plausible to
happen in about 5 years.

If SHA1 is removed in the future, there will need to be a mad dash to
replace it. Or we'll have to add a new dependency to use an alternate
implementation. Or we'll have to implement our own SHA1 or copy some code
into SVN. All of these seem bad to me.

Switching to a different hash is also a bad idea, I think, because it is
likely to suffer the same problems as SHA1 later on, as cryptography
research proceeds and newer hashes become declared broken.

I'll try to describe what I think is a best case scenario: Support
multi-hash in 1.15 in format 32 WCs. SHA1 can continue to be the default
but we should be careful not to require a SHA1 implementation to exist.
Furthermore, by default "svn checkout" continues to create format 31 WCs
(this is implemented currently). When new (1.15 and up) servers talk to new
clients, they'll have to negotiate the "best" common hash for the protocol.
Over time, we can add other hashes. Over time, distros and package managers
pick up 1.15. Someday down the line (5 years?), if SHA1 goes away, or an IT
dept wants to avoid SHA1 for whatever reasons, most of the hard work of
changing hashes will have been done already and most people will have the
newer software on their system already. Changing hashes then becomes a
trivial matter. The same will be true of any future hashes that become
declared broken, requiring almost no additional work on our part. Notably,
it will not be necessary to bump the WC or protocol formats because of
hashes.

Pros: Future-proofing against the real and perceived brokenness of any hash
types.

Cons: Requires a lot of work up front, which no one might volunteer to do.

We should continue hashing out (pun intended) how to address the different
concerns raised.

Are there any technical reasons *not* to support other hashes going forward?

Are there other pros or cons to supporting a scenario like I described?

Thanks,
Nathan

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Daniel Sahlberg <da...@gmail.com>.

Den lör 13 jan. 2024 kl 00:50 skrev Johan Corveleyn <jc...@gmail.com>:

> On Fri, Jan 12, 2024 at 12:37 PM Daniel Shahaf <d....@daniel.shahaf.name>
> wrote:
> ...
> > Procedurally, the long hiatus is counterproductive.  Neither kfogel nor
> > I had the context in our heads, and the cache misses took their toll in
> > tuits and in wallclock time.  Furthermore, I have less spare time for
> > dev@ discussions than I did when I cast the veto (= a year ago next
> > Saturday).  Going forward it might be preferable for threads not to
> > hibernate.
>
> I agree, but obviously the hibernation is not some deliberate action
> by anyone. It's just that most of us here have less spare time for
> dev@ discussions (and for SVN development) than before. Especially for
> such complex matters, and especially when people feel there are
> walking into a minefield. There are only a few active devs left, and
> tuits are running low ...
>

I agree with Johan on this. The long hiatus is unfortunate. But it won't
help to point fingers at this point.



>
> ...
> > That being the case, I have considered whether merging the feature
> > branch outweighs letting dev@ take a not-only-/pro forma/ role in
> > design discussions.  I am of the opinion that it does not, and
> > therefore I reäfirrm the veto.
>
> It has become more clear to me (I was only following tangentially)
> that your veto is focused on the development methodology and the lack
> of design discussion. Is that a valid reason for a veto? We are low on
> resources, someone still finds time to make some progress, no one
> blocks it on technical grounds, and then someone vetoes it because we
> don't have enough resources?
>
> That puts us pretty much in deadlock, because we are too low on
> resources. Or maybe I misunderstand?
>
> To be clear: I appreciate your input, Daniel, and your insistence on a
> more thorough design discussion. I assume it's coming from a genuine
> concern that we formulate problems well, and think hard about possible
> solutions (focusing on the precise problem we are trying to solve).
> But at the end of the day, if that design discussion doesn't happen
> (or not enough to your satisfaction anyway), is that grounds for a
> veto? For me it's a tough call, because on the one hand you have a
> point, but on the other hand ... you're blocking _some_ progress
> because the process behind it is not perfect (which is hard to do with
> the 3.25 tuits we have left).
>
> > P.S.  Could that BRANCH-README please state what's the problem the branch
> > means to solve, i.e., the goal / acceptance test?  "Make it possible to
> > «svn add» SHA-1 collisions"?
>
> I agree that would be a good step.
>
> I too find it a bit unclear what problem we're actually trying to
> solve, apart from a vague feeling that SHA-1 will become more and more
> broken over time, and that this will cause fatal injury to SVN (in its
> WC, protocol, dump format, or repository). And perhaps the fact that
> security auditors are becoming more and more triggered by seeing SHA-1
> (even if they don't understand the way it is used and its
> ramifications). Making it possible to 'svn add' SHA-1 collisions is
> not it, I think.
>

I also agree with this.

From what I remember of the dicsussions earlier there were concerns that a
changed file might go undetected if someone change it to another file with
a collision with the original file. I think that might be a vaild point,
especially if we don't have the pristine files anymore.

I'd also like to understand why we need the multi-checksum format instead
of just plainly switching to XXX (insert favourite checksuming algorithm
here). Does it help us to have multiple types of checksums available? Would
we use BOTH as a resort (likelyhood of collision in SHA1 and in XXX at the
same time approaching zero)? Does it help backwards/forwards compatibility?

Kind regards,
Daniel Sahlberg

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Johan Corveleyn <jc...@gmail.com>.

On Fri, Jan 12, 2024 at 12:37 PM Daniel Shahaf <d....@daniel.shahaf.name> wrote:
...
> Procedurally, the long hiatus is counterproductive.  Neither kfogel nor
> I had the context in our heads, and the cache misses took their toll in
> tuits and in wallclock time.  Furthermore, I have less spare time for
> dev@ discussions than I did when I cast the veto (= a year ago next
> Saturday).  Going forward it might be preferable for threads not to
> hibernate.

I agree, but obviously the hibernation is not some deliberate action
by anyone. It's just that most of us here have less spare time for
dev@ discussions (and for SVN development) than before. Especially for
such complex matters, and especially when people feel there are
walking into a minefield. There are only a few active devs left, and
tuits are running low ...

...
> That being the case, I have considered whether merging the feature
> branch outweighs letting dev@ take a not-only-/pro forma/ role in
> design discussions.  I am of the opinion that it does not, and
> therefore I reäfirrm the veto.

It has become more clear to me (I was only following tangentially)
that your veto is focused on the development methodology and the lack
of design discussion. Is that a valid reason for a veto? We are low on
resources, someone still finds time to make some progress, no one
blocks it on technical grounds, and then someone vetoes it because we
don't have enough resources?

That puts us pretty much in deadlock, because we are too low on
resources. Or maybe I misunderstand?

To be clear: I appreciate your input, Daniel, and your insistence on a
more thorough design discussion. I assume it's coming from a genuine
concern that we formulate problems well, and think hard about possible
solutions (focusing on the precise problem we are trying to solve).
But at the end of the day, if that design discussion doesn't happen
(or not enough to your satisfaction anyway), is that grounds for a
veto? For me it's a tough call, because on the one hand you have a
point, but on the other hand ... you're blocking _some_ progress
because the process behind it is not perfect (which is hard to do with
the 3.25 tuits we have left).

> P.S.  Could that BRANCH-README please state what's the problem the branch
> means to solve, i.e., the goal / acceptance test?  "Make it possible to
> «svn add» SHA-1 collisions"?

I agree that would be a good step.

I too find it a bit unclear what problem we're actually trying to
solve, apart from a vague feeling that SHA-1 will become more and more
broken over time, and that this will cause fatal injury to SVN (in its
WC, protocol, dump format, or repository). And perhaps the fact that
security auditors are becoming more and more triggered by seeing SHA-1
(even if they don't understand the way it is used and its
ramifications). Making it possible to 'svn add' SHA-1 collisions is
not it, I think.

-- 
Johan

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Nathan Hartman <ha...@gmail.com>.

On Thu, Feb 1, 2024 at 5:26 PM Daniel Sahlberg
<da...@gmail.com> wrote:
>
> Gentlemen,
>
> It seems you have both had your say in what flaws there has been in the process. Can we please leave this part of the discussion and continue on the technical issues? I'd hate for this discussion to turn to pie-throwing where someone in the end feel offended and leave the community. We are such a small community and we can't afford to lose someone just because an argument turns toxic (it has happened before so let's make sure it doesn't happen again, please).

I completely agree. Yes, there has been disagreement about process,
but it is counterproductive to debate that anymore. Let's focus on the
technical question and try to reach some consensus on what (if
anything) to do.

> As for the technical side, can we break down the current status and the desired future status to some points and then look at what options we have for solutions?
>
> Currently we use SHA1, which have known attacks. What are the risks?
> - It has been argued that `svn st` will, especially with no-pristines, be extra vulnerable to not detecting a modified file if someone can create a collision with the checksum of the original file
> - Someone also argued that a software could potentially be banned just because it uses a checksum with a known attack, even if the checksum isn't used in a security critical way.

I was the one who spoke about that possibility.

Just one example: NIST has already recommended federal agencies to
stop using SHA-1 for "signatures and other operations threatened by
collision attacks" and by 31 Dec 2030 NIST will publish "a revision of
FIPS 180 that removes the SHA-1 specification" and "Modules that still
use SHA-1 after 2030 will not be permitted for purchase by the federal
government." All those quotes are taken from [1], which was one of the
top hits in a recent DuckDuckGo search. (I don't remember the exact
search.)

Now, even if SVN's use cases of SHA1 are agreed by the developers to
be completely safe, I think it is a real possibility that some sites
could ban SVN because they consider SHA1 a banned algorithm, and even
if we explain that SVN's use of SHA1 is completely safe, those
explanations might not be acceptable in those settings, even if we are
right.

Given the way technology is used, understood, and sometimes (often?)
misunderstood, I can imagine a ridiculous scenario in which Subversion
could use 8-bit CRC, but not SHA1, even though SHA1 is much stronger
than 8-bit CRC, just because SHA1 is "banned" and 8-bit CRC is not.

> What options do we have and how do they mitigate the above risks?> - Evgeny has already shown a possible solution with a salted hash (keeping SHA-1).
> - Can we switch to another hash function completely and does it offer any benefits compared to the salted SHA-1?
> - Should we even do both?
>
> Any other points?
>
> Any thoughts?
>
> I would like to see this thread progress and I hope we can find consensus on a way forward.
>
> Kind regards,
> Daniel Sahlberg

I, too, hope the community can come together and reach a consensus,
whatever that ends up being.

[1] https://www.securityweek.com/nist-retire-27-year-old-sha-1-cryptographic-algorithm/

Cheers,
Nathan

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Daniel Sahlberg <da...@gmail.com>.

Gentlemen,

It seems you have both had your say in what flaws there has been in the
process. Can we please leave this part of the discussion and continue on
the technical issues? I'd hate for this discussion to turn to pie-throwing
where someone in the end feel offended and leave the community. We are such
a small community and we can't afford to lose someone just because an
argument turns toxic (it has happened before so let's make sure it doesn't
happen again, please).

As for the technical side, can we break down the current status and the
desired future status to some points and then look at what options we have
for solutions?

Currently we use SHA1, which have known attacks. What are the risks?
- It has been argued that `svn st` will, especially with no-pristines, be
extra vulnerable to not detecting a modified file if someone can create a
collision with the checksum of the original file
- Someone also argued that a software could potentially be banned just
because it uses a checksum with a known attack, even if the checksum isn't
used in a security critical way.

What options do we have and how do they mitigate the above risks?
- Evgeny has already shown a possible solution with a salted hash (keeping
SHA-1).
- Can we switch to another hash function completely and does it offer any
benefits compared to the salted SHA-1?
- Should we even do both?

Any other points?

Any thoughts?

I would like to see this thread progress and I hope we can find consensus
on a way forward.

Kind regards,
Daniel Sahlberg


Den tors 18 jan. 2024 kl 14:36 skrev Evgeny Kotkov via dev <
dev@subversion.apache.org>:

> Daniel Shahaf <d....@daniel.shahaf.name> writes:
>
> > Procedurally, the long hiatus is counterproductive.
>
> This reminds me that the substantive discussion of your veto ended with my
> email from 8 Feb 2023 that had four direct questions to you and was left
> without an answer:
>
> ``````
>   > That's not how design discussions work.  A design discussion doesn't go
>   > "state decision; state pros; implement"; it goes "state problem;
> discuss
>   > potential solutions, pros, cons; decide; implement" (cf. [4, 5, 6]).
>
>   Well, I think it may not be as simple as it seems to you.  Who decided
> that
>   we should follow the process you're describing?  Is there a thread with a
>   consensus on this topic?  Or do you insist on using this specific process
>   because it's the only process that seems obvious to you?  What
> alternatives
>   to it have been considered?
>
>   As far as I can tell, the process you're suggesting is effectively a
>   waterfall-like process, and there are quite a lot of concerns about its
>   effectiveness, because the decisions have to be made in the conditions of
>   a lack of information.
> ``````
>
> It's been more than 11 months since that email, and those questions still
> don't have an answer.  So if we are to resume this discussion, let's do it
> from the proper point.
>
> > You guys are welcome to try to /convince/ me to change my opinion, or to
> > have the veto invalidated.  In either case, you will be more likely to
> > succeed should your arguments relate not only to the veto's implications
> > but also to its /sine qua non/ component: its rationale.
>
> Just in case, my personal opinion here is that the veto is invalid.
>
> Firstly, based on my understanding, the ASF rules prohibit casting a veto
> without an appropriate technical justification (see [1], which I personally
> agree with).  Secondly, it seems that the process you are imposing hasn't
> been
> accepted in this community.  As far as I know, this topic was tangentially
> discussed before (see [2], for example), and it looks like there hasn't
> been
> a consensus to change our current Commit-Then-Review process into some
> sort of Review-Then-Commit.
>
> (At the same time I won't even try to /convince/ you, sorry.)
>
> [1] https://www.apache.org/foundation/voting.html
> [2] https://lists.apache.org/thread/ow2x68g2k4lv2ycr81d14p8r8w2jj1xl
>
>
> Regards,
> Evgeny Kotkov
>

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Evgeny Kotkov via dev <de...@subversion.apache.org>.

Daniel Shahaf <d....@daniel.shahaf.name> writes:

> Procedurally, the long hiatus is counterproductive.

This reminds me that the substantive discussion of your veto ended with my
email from 8 Feb 2023 that had four direct questions to you and was left
without an answer:

``````
  > That's not how design discussions work.  A design discussion doesn't go
  > "state decision; state pros; implement"; it goes "state problem; discuss
  > potential solutions, pros, cons; decide; implement" (cf. [4, 5, 6]).

  Well, I think it may not be as simple as it seems to you.  Who decided that
  we should follow the process you're describing?  Is there a thread with a
  consensus on this topic?  Or do you insist on using this specific process
  because it's the only process that seems obvious to you?  What alternatives
  to it have been considered?

  As far as I can tell, the process you're suggesting is effectively a
  waterfall-like process, and there are quite a lot of concerns about its
  effectiveness, because the decisions have to be made in the conditions of
  a lack of information.
``````

It's been more than 11 months since that email, and those questions still
don't have an answer.  So if we are to resume this discussion, let's do it
from the proper point.

> You guys are welcome to try to /convince/ me to change my opinion, or to
> have the veto invalidated.  In either case, you will be more likely to
> succeed should your arguments relate not only to the veto's implications
> but also to its /sine qua non/ component: its rationale.

Just in case, my personal opinion here is that the veto is invalid.

Firstly, based on my understanding, the ASF rules prohibit casting a veto
without an appropriate technical justification (see [1], which I personally
agree with).  Secondly, it seems that the process you are imposing hasn't been
accepted in this community.  As far as I know, this topic was tangentially
discussed before (see [2], for example), and it looks like there hasn't been
a consensus to change our current Commit-Then-Review process into some
sort of Review-Then-Commit.

(At the same time I won't even try to /convince/ you, sorry.)

[1] https://www.apache.org/foundation/voting.html
[2] https://lists.apache.org/thread/ow2x68g2k4lv2ycr81d14p8r8w2jj1xl

Regards,
Evgeny Kotkov

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Karl Fogel wrote on Wed, 03 Jan 2024 22:13 +00:00:
> On 01 Apr 2023, Evgeny Kotkov via dev wrote:
> > Daniel Shahaf <d....@daniel.shahaf.name> writes:
> > 
> > > What's the question or action item to/for me?  Thanks.
> > 
> > I'm afraid I don't fully understand your question.  As you
> > probably remember, the change is blocked by your veto.  To my
> > knowledge, this veto hasn't been revoked as of now, and I simply
> > mentioned that in my email.  It is entirely your decision
> > whether or not to take any action regarding this matter.
> 
> So AIUI, Evgeny is asking you to withdraw your veto, Daniel. Evgeny would
> like to merge this into trunk -- on the grounds, I believe, that it is
> strictly an improvement over what we have now, and it opens the door to
> further future improvements (each of which would go through the usual
> discussion & consensus process, of course).

So, I looked.

This thread comprises 237 posts spanning 30 months (July 2021 through
today).  On 2023-01-20 I cast a veto.  There was some activity
afterwards, but until the parent post of this one, the thread has been
silent for the better part of a year; and now I'm being asked to
withdraw my veto.

Procedurally, the long hiatus is counterproductive.  Neither kfogel nor
I had the context in our heads, and the cache misses took their toll in
tuits and in wallclock time.  Furthermore, I have less spare time for
dev@ discussions than I did when I cast the veto (= a year ago next
Saturday).  Going forward it might be preferable for threads not to
hibernate.

You didn't link the veto, so I had to go grep for it.  It is,
presumably, this one:

>>>> # Archived-At: https://mail-archives.apache.org/mod_mbox/subversion-dev/202212.mbox/%3C904aded6-5ef0-4123-ade0-e23a3bb56726%40app.fastmail.com%3E
>>>> Date: Fri, 20 Jan 2023 12:15:24 +0000
>>>> From: Daniel Shahaf
>>>> To: dev@subversion.apache.org
>>>> Subject: Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format
>>>> Message-Id: <90...@app.fastmail.com>
>>>> 
>>>> Evgeny Kotkov via dev wrote on Thu, 19 Jan 2023 18:52 +00:00:
>>>> > I can complete the work on this branch and bring it to a production-ready
>>>> > state, assuming there are no objections.
>>>> 
>>>> Your assumption is counterfactual:
>>>> 
>>>> https://mail-archives.apache.org/mod_mbox/subversion-dev/202301.mbox/%3C20230119152001.GA27446%40tarpaulin.shahaf.local2%3E
>>>> 
>>>> https://mail-archives.apache.org/mod_mbox/subversion-dev/202212.mbox/%3CCAMHy98NqYBLZaTL5-FAbf24RR6bagPN1npC5gsZenewZb0-EuQ%40mail.gmail.com%3E
>>>> 
>>>> Objections have been raised, been left unanswered, and now
>>>> implementation work has commenced following the original design.  That's
>>>> not acceptable.  I'm vetoing the change until a non-rubber-stamp design
>>>> discussion has been completed on the public dev@ list.

So, this veto being in front of me, let me reply to the request that
I withdraw it:

> So AIUI, Evgeny is asking you to withdraw your veto, Daniel. Evgeny would
> like to merge this into trunk -- on the grounds, I believe, that it is
> strictly an improvement over what we have now, and it opens the door to
> further future improvements (each of which would go through the usual
> discussion & consensus process, of course).
> 
> Evgeny's work is on this branch...
> 
> https://svn.apache.org/repos/asf/subversion/branches/pristine-checksum-salt
> 
> ...which in turn branched from
> https://svn.apache.org/repos/asf/subversion/branches/pristine-checksum-kind.
> 
> I used this command to get an overview of the work:
> 
> $ svn cat https://svn.apache.org/repos/asf/subversion/branches/pristine-checksum-salt/BRANCH-README

As far as I can tell, the request for veto withdrawal is grounded only
in the fact that the veto, whilst in force, prevents the feature branch
from being merged/released.  The request does not allege the veto was
invalid or unfounded in the first place; nor that the veto has /become/
invalid or unfounded due to time having passed; nor that modifications
or alterations to the code [or, in this case, to the decision-making
process] have been made and are believed to have addressed the veto's
grounds.

In summary, the request only deals with the fact of a veto and its
formal/procedural implications, but does not deal with the substantive
justification for the veto at all.

That being the case, I have no reason to believe the original grounds of
the veto have been addressed.

That being the case, I have considered whether merging the feature
branch outweighs letting dev@ take a not-only-/pro forma/ role in
design discussions.  I am of the opinion that it does not, and
therefore I reäfirrm the veto.

You guys are welcome to try to /convince/ me to change my opinion, or to
have the veto invalidated.  In either case, you will be more likely to
succeed should your arguments relate not only to the veto's implications
but also to its /sine qua non/ component: its rationale.

Before I salutate this post, I wish to point out that it's rather
ironic — or perhaps I should say /alarming/ — that the request for veto
withdrawal does not deal with the substantive grounds for the veto,
considering those grounds were "dev@ isn't being listened to".  In fact,
this is so inconsistent with the past 15+ years of kfogel interactions
that I feel I should ask whoever happens to live closest to kfogel's if
they would be so very kind as to pop over there, knock on the front
door, and tell him his email is being impersonated.  (Naturally, make
sure it's actually him at the door, first. :P)

Cheers,

Daniel

P.S.  Could that BRANCH-README please state what's the problem the branch
means to solve, i.e., the goal / acceptance test?  "Make it possible to
«svn add» SHA-1 collisions"?

> Evgeny's work is on this branch...
> 
> https://svn.apache.org/repos/asf/subversion/branches/pristine-checksum-salt
> 
> ...which in turn branched from
> https://svn.apache.org/repos/asf/subversion/branches/pristine-checksum-kind.
> 
> I used this command to get an overview of the work:
> 
> $ svn cat https://svn.apache.org/repos/asf/subversion/branches/pristine-checksum-salt/BRANCH-README
> 
> (The work is several months old now, but for the sake of discussion let's
> assume it's mergeable, passes all tests, etc. Obviously, Evgeny's only going
> to merge it when all of those conditions are true -- maybe some minor tweaks
> will be needed to get it there, I don't know.)

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Karl Fogel <kf...@red-bean.com>.

On 04 Jan 2024, Daniel Shahaf wrote:
>Acknowledging receipt.  I'll reply substantively when I have the 
>time to swap in the context.

Thanks.  Yeah, I went through the same context-swapping-in process 
yesterday before posting!

Best regards,
-Karl

>> Evgeny's work is on this branch...
>>
>> https://svn.apache.org/repos/asf/subversion/branches/pristine-checksum-salt
>>
>> ...which in turn branched from 
>> https://svn.apache.org/repos/asf/subversion/branches/pristine-checksum-kind.
>>
>> I used this command to get an overview of the work:
>>
>> $ svn cat 
>> https://svn.apache.org/repos/asf/subversion/branches/pristine-checksum-salt/BRANCH-README
>>
>> (The work is several months old now, but for the sake of 
>> discussion let's assume it's mergeable, passes all tests, etc. 
>> Obviously, Evgeny's only going to merge it when all of those 
>> conditions are true -- maybe some minor tweaks will be needed 
>> to 
>> get it there, I don't know.)
>>
>> Best regards,
>> -Karl

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Karl Fogel wrote on Wed, 03 Jan 2024 22:13 +00:00:
> On 01 Apr 2023, Evgeny Kotkov via dev wrote:
>>Daniel Shahaf <d....@daniel.shahaf.name> writes:
>>
>>> What's the question or action item to/for me?  Thanks.
>>
>>I'm afraid I don't fully understand your question.  As you
>>probably remember, the change is blocked by your veto.  To my
>>knowledge, this veto hasn't been revoked as of now, and I simply
>>mentioned that in my email.  It is entirely your decision
>>whether or not to take any action regarding this matter.
>
> So AIUI, Evgeny is asking you to withdraw your veto, Daniel. 
> Evgeny would like to merge this into trunk -- on the grounds, I 
> believe, that it is strictly an improvement over what we have now, 
> and it opens the door to further future improvements (each of 
> which would go through the usual discussion & consensus process, 
> of course).
>

Acknowledging receipt.  I'll reply substantively when I have the time to swap in the context.

Daniel

> Evgeny's work is on this branch...
>
> https://svn.apache.org/repos/asf/subversion/branches/pristine-checksum-salt
>
> ...which in turn branched from 
> https://svn.apache.org/repos/asf/subversion/branches/pristine-checksum-kind.
>
> I used this command to get an overview of the work:
>
> $ svn cat 
> https://svn.apache.org/repos/asf/subversion/branches/pristine-checksum-salt/BRANCH-README
>
> (The work is several months old now, but for the sake of 
> discussion let's assume it's mergeable, passes all tests, etc. 
> Obviously, Evgeny's only going to merge it when all of those 
> conditions are true -- maybe some minor tweaks will be needed to 
> get it there, I don't know.)
>
> Best regards,
> -Karl

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Karl Fogel <kf...@red-bean.com>.

On 01 Apr 2023, Evgeny Kotkov via dev wrote:
>Daniel Shahaf <d....@daniel.shahaf.name> writes:
>
>> What's the question or action item to/for me?  Thanks.
>
>I'm afraid I don't fully understand your question.  As you
>probably remember, the change is blocked by your veto.  To my
>knowledge, this veto hasn't been revoked as of now, and I simply
>mentioned that in my email.  It is entirely your decision
>whether or not to take any action regarding this matter.

So AIUI, Evgeny is asking you to withdraw your veto, Daniel. 
Evgeny would like to merge this into trunk -- on the grounds, I 
believe, that it is strictly an improvement over what we have now, 
and it opens the door to further future improvements (each of 
which would go through the usual discussion & consensus process, 
of course).

Evgeny's work is on this branch...

https://svn.apache.org/repos/asf/subversion/branches/pristine-checksum-salt

...which in turn branched from 
https://svn.apache.org/repos/asf/subversion/branches/pristine-checksum-kind.

I used this command to get an overview of the work:

$ svn cat 
https://svn.apache.org/repos/asf/subversion/branches/pristine-checksum-salt/BRANCH-README

(The work is several months old now, but for the sake of 
discussion let's assume it's mergeable, passes all tests, etc. 
Obviously, Evgeny's only going to merge it when all of those 
conditions are true -- maybe some minor tweaks will be needed to 
get it there, I don't know.)

Best regards,
-Karl

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Evgeny Kotkov via dev <de...@subversion.apache.org>.

Daniel Shahaf <d....@daniel.shahaf.name> writes:

> What's the question or action item to/for me?  Thanks.

I'm afraid I don't fully understand your question.  As you probably remember,
the change is blocked by your veto.  To my knowledge, this veto hasn't been
revoked as of now, and I simply mentioned that in my email.  It is entirely
your decision whether or not to take any action regarding this matter.


Thanks,
Evgeny Kotkov

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Evgeny Kotkov via dev wrote on Wed, 22 Mar 2023 15:23 +00:00:
> This change is still being blocked by a veto, but if danielsh changes his
> mind and if there won't be other objections, I'm ready to complete the few
> remaining bits and merge it to trunk.

What's the question or action item to/for me?  Thanks.

Daniel

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Evgeny Kotkov via dev <de...@subversion.apache.org>.

Evgeny Kotkov <ev...@visualsvn.com> writes:

> > Now, how hard would this be to actually implement?
>
> To have a more or less accurate estimate, I went ahead and prepared the
> first-cut implementation of an approach that makes the pristine checksum
> kind configurable in a working copy.
>
> The current implementation passes all tests in my environment and seems to
> work in practice.  It is available on the branch:
>
>   https://svn.apache.org/repos/asf/subversion/branches/pristine-checksum-kind
>
> The implementation on the branch allows creating working copies that use a
> checksum kind other than SHA-1.

I extended the current implementation to use a dynamically salted SHA-1
checksum, rather than a SHA-1 with a statically hardcoded salt.
The dynamic salt is generated during the creation of a wc.db.

The implementation is available on a separate branch:

  https://svn.apache.org/repos/asf/subversion/branches/pristine-checksum-salt

The change is a bit massive, but in the meantime I think that it should solve
the potential problem without any practical drawbacks, except for the lack
of the mentioned ra_serf fetch optimization.

So overall I'd propose to bring this change to trunk, to improve the current
state around checksum collisions in the working copy, and to also have the
infrastructure for supporting different checksum kinds in place, in case
we need it in the future.

This change is still being blocked by a veto, but if danielsh changes his
mind and if there won't be other objections, I'm ready to complete the few
remaining bits and merge it to trunk.

Thanks,
Evgeny Kotkov

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

To be clear, I wasn't vetoing changing the hash algorithm.  I was
vetoing making a change without discussion.  If there is discussion and
it results in consensus to change the algorithm, that'll be absolutely
fine by me.

Daniel

Karl Fogel wrote on Sat, 21 Jan 2023 17:58 +00:00:
> *nod* This issue isn't important enough to me to continue the 
> conversation -- I'd like for new hash algorithms to be possible, 
> and I think Evgeny's work on it is worthwhile, but I don't feel 
> nearly as strongly about this as I feel about making the new 
> pristineless working copies available in an official release as 
> soon as we can.
>
> Best regards,
> -Karl
>
> On 21 Jan 2023, Daniel Shahaf wrote:
>>Karl Fogel wrote on Fri, Jan 20, 2023 at 11:09:11 -0600:
>>> On 20 Jan 2023, Daniel Shahaf wrote:
>>> > Evgeny Kotkov via dev wrote on Thu, 19 Jan 2023 18:52 +00:00:
>>> > > I can complete the work on this branch and bring it to a
>>> > > production-ready
>>> > > state, assuming there are no objections.
>>> > 
>>> > Your assumption is counterfactual:
>>> > 
>>> > https://mail-archives.apache.org/mod_mbox/subversion-dev/202301.mbox/%3C20230119152001.GA27446%40tarpaulin.shahaf.local2%3E
>>> > 
>>> > https://mail-archives.apache.org/mod_mbox/subversion-dev/202212.mbox/%3CCAMHy98NqYBLZaTL5-FAbf24RR6bagPN1npC5gsZenewZb0-EuQ%40mail.gmail.com%3E
>>> > 
>>> > Objections have been raised, been left unanswered, and now
>>> > implementation work has commenced following the original 
>>> > design. That's
>>> > not acceptable.
>>> 
>>> I'm a little surprised by your reaction.
>>> 
>>> It is never "not acceptable" for someone to do implementation 
>>> work on a
>>> branch while a discussion is happening, even if that discussion 
>>> contains
>>> objections to or questions about the premise of the branch 
>>> work.
>>> 
>>> It's a branch.  He didn't merge it to trunk, and he posted it 
>>> as an explicit
>>> invitation for discussion.
>>> 
>>
>>I didn't object to the use of a branch /per se/.  I objected to 
>>the
>>treating of objections that *had already been posted* as though 
>>they had
>>never been posted.  *That's* not acceptable.
>>
>>However, since you ask, I don't think implementing a proposal on
>>a branch is necessarily a good idea:
>>
>>- If the branch is seen and presented as a PoC for furthering 
>>discussion
>>  and for discovering practical considerations (e.g., that
>>  PRISTINE.MD5_CHECKSUM docstring I found yesterday during 
>>  discussion,
>>  or the ra_serf sha1 optimization that anyone implementing the 
>>  branch
>>  would run into), it's likely a good thing.
>>  
>>- On the other hand, when the branch implements the original 
>>proposal,
>>  whilst outstanding questions were not only not answered but 
>>  also not
>>  acknowledged, that's quite another thing.  It can result in:
>>
>>  + The branch maintainer being biased in favour of the approach 
>>  they
>>    have implemented.  (People tend not to argue against what 
>>    they have
>>    expended resources on.  Cf. plan continuation bias, sunk cost
>>    fallacy.)
>>
>>  + dev@ being biased towards the approach that has been 
>>  implemented
>>    (because it's a known entity; because no one is volunteering 
>>    to
>>    implement another approach; because there's a desire to cut
>>    a minor release soon…).  This, in turn, can result in…
>>  
>>  + …an incentive for participants *not* to hold open design
>>    discussions on dev@ in the first place.
>>
>>> > I'm vetoing the change until a non-rubber-stamp design
>>> > discussion has been completed on the public dev@ list.
>>> 
>>> Starting an implementation on a branch is a valuable 
>>> contribution to a
>>> design discussion -- it's exactly the kind of 
>>> "non-rubber-stamp"
>>> contribution one would want.
>>> 
>>
>>You're just repeating what you said above.
>>
>>> If you want to re-iterate points you've made that have been 
>>> left unanswered,
>>> that would be a useful contribution -- perhaps some of those 
>>> points will be
>>> updated now that there's actual code, or perhaps they won't. 
>>> Either way,
>>> what Evgeny is doing here seems very constructive to me, and 
>>> entirely within
>>> the normal range of how we do things.
>>
>>Posting a paragraph such as the one I'm replying to is not 
>>"entirely
>>within the normal range of how we do things".  As to my points, 
>>see
>><https://mail-archives.apache.org/mod_mbox/subversion-dev/202301.mbox/%3C20230119152001.GA27446%40tarpaulin.shahaf.local2%3E>.
>>They boil down to this:
>>
>>    <alice> We should migrate away from SHA-1.
>>    <bob> Why?
>>
>>Daniel
>>
>>> Best regards,
>>> -Karl

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Karl Fogel <kf...@red-bean.com>.

*nod* This issue isn't important enough to me to continue the 
conversation -- I'd like for new hash algorithms to be possible, 
and I think Evgeny's work on it is worthwhile, but I don't feel 
nearly as strongly about this as I feel about making the new 
pristineless working copies available in an official release as 
soon as we can.

Best regards,
-Karl

On 21 Jan 2023, Daniel Shahaf wrote:
>Karl Fogel wrote on Fri, Jan 20, 2023 at 11:09:11 -0600:
>> On 20 Jan 2023, Daniel Shahaf wrote:
>> > Evgeny Kotkov via dev wrote on Thu, 19 Jan 2023 18:52 +00:00:
>> > > I can complete the work on this branch and bring it to a
>> > > production-ready
>> > > state, assuming there are no objections.
>> > 
>> > Your assumption is counterfactual:
>> > 
>> > https://mail-archives.apache.org/mod_mbox/subversion-dev/202301.mbox/%3C20230119152001.GA27446%40tarpaulin.shahaf.local2%3E
>> > 
>> > https://mail-archives.apache.org/mod_mbox/subversion-dev/202212.mbox/%3CCAMHy98NqYBLZaTL5-FAbf24RR6bagPN1npC5gsZenewZb0-EuQ%40mail.gmail.com%3E
>> > 
>> > Objections have been raised, been left unanswered, and now
>> > implementation work has commenced following the original 
>> > design. That's
>> > not acceptable.
>> 
>> I'm a little surprised by your reaction.
>> 
>> It is never "not acceptable" for someone to do implementation 
>> work on a
>> branch while a discussion is happening, even if that discussion 
>> contains
>> objections to or questions about the premise of the branch 
>> work.
>> 
>> It's a branch.  He didn't merge it to trunk, and he posted it 
>> as an explicit
>> invitation for discussion.
>> 
>
>I didn't object to the use of a branch /per se/.  I objected to 
>the
>treating of objections that *had already been posted* as though 
>they had
>never been posted.  *That's* not acceptable.
>
>However, since you ask, I don't think implementing a proposal on
>a branch is necessarily a good idea:
>
>- If the branch is seen and presented as a PoC for furthering 
>discussion
>  and for discovering practical considerations (e.g., that
>  PRISTINE.MD5_CHECKSUM docstring I found yesterday during 
>  discussion,
>  or the ra_serf sha1 optimization that anyone implementing the 
>  branch
>  would run into), it's likely a good thing.
>  
>- On the other hand, when the branch implements the original 
>proposal,
>  whilst outstanding questions were not only not answered but 
>  also not
>  acknowledged, that's quite another thing.  It can result in:
>
>  + The branch maintainer being biased in favour of the approach 
>  they
>    have implemented.  (People tend not to argue against what 
>    they have
>    expended resources on.  Cf. plan continuation bias, sunk cost
>    fallacy.)
>
>  + dev@ being biased towards the approach that has been 
>  implemented
>    (because it's a known entity; because no one is volunteering 
>    to
>    implement another approach; because there's a desire to cut
>    a minor release soon…).  This, in turn, can result in…
>  
>  + …an incentive for participants *not* to hold open design
>    discussions on dev@ in the first place.
>
>> > I'm vetoing the change until a non-rubber-stamp design
>> > discussion has been completed on the public dev@ list.
>> 
>> Starting an implementation on a branch is a valuable 
>> contribution to a
>> design discussion -- it's exactly the kind of 
>> "non-rubber-stamp"
>> contribution one would want.
>> 
>
>You're just repeating what you said above.
>
>> If you want to re-iterate points you've made that have been 
>> left unanswered,
>> that would be a useful contribution -- perhaps some of those 
>> points will be
>> updated now that there's actual code, or perhaps they won't. 
>> Either way,
>> what Evgeny is doing here seems very constructive to me, and 
>> entirely within
>> the normal range of how we do things.
>
>Posting a paragraph such as the one I'm replying to is not 
>"entirely
>within the normal range of how we do things".  As to my points, 
>see
><https://mail-archives.apache.org/mod_mbox/subversion-dev/202301.mbox/%3C20230119152001.GA27446%40tarpaulin.shahaf.local2%3E>.
>They boil down to this:
>
>    <alice> We should migrate away from SHA-1.
>    <bob> Why?
>
>Daniel
>
>> Best regards,
>> -Karl

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Karl Fogel wrote on Fri, Jan 20, 2023 at 11:09:11 -0600:
> On 20 Jan 2023, Daniel Shahaf wrote:
> > Evgeny Kotkov via dev wrote on Thu, 19 Jan 2023 18:52 +00:00:
> > > I can complete the work on this branch and bring it to a
> > > production-ready
> > > state, assuming there are no objections.
> > 
> > Your assumption is counterfactual:
> > 
> > https://mail-archives.apache.org/mod_mbox/subversion-dev/202301.mbox/%3C20230119152001.GA27446%40tarpaulin.shahaf.local2%3E
> > 
> > https://mail-archives.apache.org/mod_mbox/subversion-dev/202212.mbox/%3CCAMHy98NqYBLZaTL5-FAbf24RR6bagPN1npC5gsZenewZb0-EuQ%40mail.gmail.com%3E
> > 
> > Objections have been raised, been left unanswered, and now
> > implementation work has commenced following the original design. That's
> > not acceptable.
> 
> I'm a little surprised by your reaction.
> 
> It is never "not acceptable" for someone to do implementation work on a
> branch while a discussion is happening, even if that discussion contains
> objections to or questions about the premise of the branch work.
> 
> It's a branch.  He didn't merge it to trunk, and he posted it as an explicit
> invitation for discussion.
> 

I didn't object to the use of a branch /per se/.  I objected to the
treating of objections that *had already been posted* as though they had
never been posted.  *That's* not acceptable.

However, since you ask, I don't think implementing a proposal on
a branch is necessarily a good idea:

- If the branch is seen and presented as a PoC for furthering discussion
  and for discovering practical considerations (e.g., that
  PRISTINE.MD5_CHECKSUM docstring I found yesterday during discussion,
  or the ra_serf sha1 optimization that anyone implementing the branch
  would run into), it's likely a good thing.
  
- On the other hand, when the branch implements the original proposal,
  whilst outstanding questions were not only not answered but also not
  acknowledged, that's quite another thing.  It can result in:

  + The branch maintainer being biased in favour of the approach they
    have implemented.  (People tend not to argue against what they have
    expended resources on.  Cf. plan continuation bias, sunk cost
    fallacy.)

  + dev@ being biased towards the approach that has been implemented
    (because it's a known entity; because no one is volunteering to
    implement another approach; because there's a desire to cut
    a minor release soon…).  This, in turn, can result in…
  
  + …an incentive for participants *not* to hold open design
    discussions on dev@ in the first place.

> > I'm vetoing the change until a non-rubber-stamp design
> > discussion has been completed on the public dev@ list.
> 
> Starting an implementation on a branch is a valuable contribution to a
> design discussion -- it's exactly the kind of "non-rubber-stamp"
> contribution one would want.
> 

You're just repeating what you said above.

> If you want to re-iterate points you've made that have been left unanswered,
> that would be a useful contribution -- perhaps some of those points will be
> updated now that there's actual code, or perhaps they won't.  Either way,
> what Evgeny is doing here seems very constructive to me, and entirely within
> the normal range of how we do things.

Posting a paragraph such as the one I'm replying to is not "entirely
within the normal range of how we do things".  As to my points, see
<https://mail-archives.apache.org/mod_mbox/subversion-dev/202301.mbox/%3C20230119152001.GA27446%40tarpaulin.shahaf.local2%3E>.
They boil down to this:

    <alice> We should migrate away from SHA-1.
    <bob> Why?

Daniel

> Best regards,
> -Karl

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Evgeny Kotkov via dev <de...@subversion.apache.org>.

Daniel Shahaf <d....@daniel.shahaf.name> writes:

> > That could happen after a public disclosure of a pair of executable
> > files/scripts where the forged version allows for remote code execution.
> > Or maybe something similar with a file format that is often stored in
> > repositories and that can be executed or used by a build script, etc.
> >
>
> Err, hang on.  Your reference described a chosen-prefix attack, while
> this scenario concerns a single public collision.  These are two
> different things.

A chosen-prefix attack allows finding more meaningful collisions such as
working executables/scripts.  When such collisions are made public, they
would have a greater exploitation potential than just a random collision.

> Disclosure of of a pair of executable files/scripts isn't by itself
> a problem unless one of the pair ("file A") is in a repository
> somewhere.  Now, was the colliding file ("file B") generated _before_ or
> _after_ file A was committed?
>
> - If _before_, then it would seem Mallory had somehow managed to:
>
>   1. get a file of his choosing committed to Alice's repository; and
>
>   2. get a wc of Alice's repository into one of the codepaths that
>      assume SHA-1 is one-to-one / collission-free (currently that's the
>      ra_serf optimization and the 1.15 wc status).

Not only.  There are cases when the working copy itself installs the working
file with a hash lookup in the pristine store.  This is more true for 1.14
than trunk, because in trunk we have the streamy checkout/update that avoid
such lookups by writing straight to the working file.  However, some of
the code paths still install the contents from the pristine store by hash.
Examples include reverting a file, copying an unmodified file, switching
a file with keywords, the mentioned ra_serf optimization, and etc.

>   Now, step #1 seems plausible enough.  As to step #2, it's not clear to
>   me how file B would reach the wc in step #2…

If Mallory has write access, she could commit both files, thus arranging for
a possible content change if both files are checked out to a single working
copy.  This isn't the same as just directly modifying the target file, because
file content isn't expected to change due to changes in other files (that can
be of any type), so this attack has much better chances of being unnoticed.

If Mallory doesn't have write access, there should be other vectors, such
as distributing a pair of files (harmless in the context of their respective
file formats) separately via two upstream channels.  Then, if both of the
upstream distributions are committed into a repository and their files are
checked out together, the content will change, allowing for a malicious
action.

Regards,
Evgeny Kotkov

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Karl Fogel <kf...@red-bean.com>.

On 31 Jan 2023, Daniel Shahaf wrote:
>Karl Fogel wrote on Mon, 30 Jan 2023 23:26 +00:00:
>> Daniel, given what's in Evgeny's branch now, could you 
>> summarize 
>> your current technical objections if any?
>
>Certainly, but I won't have time to do so today.

Oh, my gosh, I'd be the last person to ever complain about someone 
not being prompt in sending a detailed technical reply here :-). 
It takes me *weeks* sometimes.  Whenever you get time is good.

Best regards,
-Karl

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Karl Fogel wrote on Mon, 30 Jan 2023 23:26 +00:00:
> Daniel, given what's in Evgeny's branch now, could you summarize 
> your current technical objections if any?

Certainly, but I won't have time to do so today.

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Karl Fogel <kf...@red-bean.com>.

On 29 Jan 2023, Evgeny Kotkov via dev wrote:
>I have *absolutely* no idea where "being railroaded through" 
>comes from.
>Really, it's a wrong way of portraying and thinking about the 
>events that have
>happened so far.
>
>Reiterating over those events: I wrote an email containing my 
>thoughts
>and explaining the motivation for such change.  I didn't reply to 
>some of
>the questions (including some tricky questions, such as the one 
>featuring
>a theoretical hash function), because they have been at least 
>partly
>answered by others in the thread, and I didn't have anything 
>valuable
>to add at that time.
>
>During that time, I was actively coding the core part of the 
>change,
>to check if it's possible technically.  Which is important, as 
>far as
>I believe, because not all theoretically possible solutions can 
>be implemented
>without facing significant practical or implementation-related 
>issues, and
>it seems to me that you significantly undervalue such an 
>approach.
>
>I do not say my actions were exemplary, but as far as I can tell, 
>they're
>pretty much in line with how svn-dev has been operating so far. 
>But, it all
>resulted in an unclear veto without any _technical_ arguments, 
>where what's
>being vetoed is unclear as well, because the change was not ready 
>at the
>moment veto got casted.
>
>And because your veto goes in favor of a specific process 
>(considering that
>no other arguments were given), the only thing that's *actually* 
>being
>railroaded is an odd form of an RTC (review-then-commit) process 
>that is
>against our usual CTR (commit-then-review) [1,2].  That's 
>railroading,
>because it hasn't been explicitly discussed anywhere and a 
>consensus
>on it has not been reached.

Daniel, given what's in Evgeny's branch now, could you summarize 
your current technical objections if any?

If they are something like "This code is solving the wrong 
problem(s)" or "I'm not sure what problem(s) it's supposed to 
solve", those count as technical objections.  It's just that it 
would be useful to have the objection(s) gathered in one place. 
This thread has been long and somewhat digressive -- I'm not 
saying that's due to you -- and I at least have found it a bit 
difficult to keep track of the concrete objections versus various 
interesting but ultimately theoretical points.

The reason I'm supportive of Evgeny's direction is that his 
changes, if completed, would offer a solution to the (admittedly 
still somewhat distant) security concern I raised early on. 
Essentially, I'm worried that second-preimage attacks on SHA-1 are 
coming eventually (maybe I'm wrong about this -- they are after 
all significantly harder than mere collision attacks).  *If* such 
attacks become possible, then our WC could report a file as 
unmodified when in fact it is modified, which would have real 
security implications, as I outlined.

Like I said, this is far from urgent, and IMHO it certainly should 
not delay a release of our new pristineless feature.  But when and 
if Evgeny's branch is ready (where "ready" presumably includes 
something other than salted SHA-1 as the other checksum option), I 
would like to see these changes go in, unless we identify some 
harm from them.

For everyone's ease of reference:

$ svn cat 
https://svn.apache.org/repos/asf/subversion/branches/pristine-checksum-kind/BRANCH-README

$ svn log --stop-on-copy 
https://svn.apache.org/repos/asf/subversion/branches/pristine-checksum-kind/

Best regards,
-Karl

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Evgeny Kotkov via dev <de...@subversion.apache.org>.

Daniel Shahaf <d....@daniel.shahaf.name> writes:

> > (I'm not saying that the above rules have to be used in this particular case
> >  and that a veto is invalid, but still thought it’s worth mentioning.)
> >
>
> I vetoed the change because it hadn't been designed on the dev@ list,
> had not garnered dev@'s consensus, and was being railroaded through.
> (as far as I could tell)

I have *absolutely* no idea where "being railroaded through" comes from.
Really, it's a wrong way of portraying and thinking about the events that have
happened so far.

Reiterating over those events: I wrote an email containing my thoughts
and explaining the motivation for such change.  I didn't reply to some of
the questions (including some tricky questions, such as the one featuring
a theoretical hash function), because they have been at least partly
answered by others in the thread, and I didn't have anything valuable
to add at that time.

During that time, I was actively coding the core part of the change,
to check if it's possible technically.  Which is important, as far as
I believe, because not all theoretically possible solutions can be implemented
without facing significant practical or implementation-related issues, and
it seems to me that you significantly undervalue such an approach.

I do not say my actions were exemplary, but as far as I can tell, they're
pretty much in line with how svn-dev has been operating so far.  But, it all
resulted in an unclear veto without any _technical_ arguments, where what's
being vetoed is unclear as well, because the change was not ready at the
moment veto got casted.

And because your veto goes in favor of a specific process (considering that
no other arguments were given), the only thing that's *actually* being
railroaded is an odd form of an RTC (review-then-commit) process that is
against our usual CTR (commit-then-review) [1,2].  That's railroading,
because it hasn't been explicitly discussed anywhere and a consensus
on it has not been reached.

[1] https://www.apache.org/foundation/glossary.html#CommitThenReview
[2] https://www.apache.org/foundation/glossary.html#ReviewThenCommit

Regards,
Evgeny Kotkov

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Evgeny Kotkov via dev wrote on Mon, Jan 23, 2023 at 02:28:50 +0300:
> Daniel Shahaf <d....@daniel.shahaf.name> writes:
> 
> > > I can complete the work on this branch and bring it to a production-ready
> > > state, assuming there are no objections.
> >
> > Your assumption is counterfactual:
> >
> > https://mail-archives.apache.org/mod_mbox/subversion-dev/202301.mbox/%3C20230119152001.GA27446%40tarpaulin.shahaf.local2%3E
> >
> > https://mail-archives.apache.org/mod_mbox/subversion-dev/202212.mbox/%3CCAMHy98NqYBLZaTL5-FAbf24RR6bagPN1npC5gsZenewZb0-EuQ%40mail.gmail.com%3E
> 
> I don't see any explicit objections in these two emails (here I assume that
> if something is not clear to a PMC member, it doesn't automatically become
> an objection).  If the "why?" question is indeed an objection, then I would
> say it has already been discussed and responded to in the thread.
> 

The "Why?" was sent _after_ the post you're quoting, and in any case was
just an elevator pitch summary of something I had explained more verbosely.

The first post in this thread asserts X is a problem and Y is a solution
to it, and argues that Y is a good thing.  However, that post does not
explain /why/ X is a problem, does not consider alternatives to Y, and
does not consider possible cons of Y.  That's what's missing.

> Now, returning to the problem:
> 
> As described in the advisory [1], we have a supported configuration that
> makes data forgery possible:
> 
> - A repository with disabled rep-sharing allows storing different files with
>   colliding SHA-1 values.
> - Having a repository with disabled rep-sharing is a supported configuration.
>   There may be a certain number of such repositories in the wild
>   (for example, created with SVN < 1.6 and not upgraded afterwise).
> - A working copy uses an assumption that the pristine contents are equal if
>   their SHA-1 hashes are equal.
> - So committing different files with colliding SHA-1 values makes it possible
>   to forge the contents of a file that will be checked-out and used by the
>   client.
> 
> I would say that this state is worrying just by itself.
> 

I assume this situation could happen accidentally, say, if someone adds
shattered-1.pdf and shattered-2.pdf to the same wc in a particular way.
That is, I'm not assuming "forgery" (which implies Mallory is involved).

Still, this is a potential data integrity issue with the new-in-1.15 wc
format, so we should address it before the release.  What are our
options to address that?  Switching to another checksum is an option,
yes, but we [as in, dev@] don't seem to have considered any alternatives
to that.

Just off the top of my head, we could:

- Encourage or require use of rep-sharing
  [the advisory already recommends this]

- Encourage or require use of tools/hook-scripts/reject-detected-sha1-collisions.sh
  [the advisory already recommends this]

- Have f32 wc's refuse to talk to servers that don't detect SHA-1
  collisions.  (1.15 users will still be able to interoperate with old
  servers by using f31.)

And there may be more options.  (Lurkers are invited to speak up!)

> However, with the feasibility of chosen-prefix attacks on SHA-1 [2], it's
> probably only a matter of time until the situation becomes worse.
> 

Quoting the third hunk of 
<https://mail-archives.apache.org/mod_mbox/subversion-dev/202212.mbox/%3C20221220201300.GH32332%40tarpaulin.shahaf.local2%3E>:

    What's the acceptance test we use for candidate checksum algorithms?

    You say we should switch to a checksum algorithm that doesn't have known
    collisions, but, why should we require that?  Consider the following
    160-bit checksum algorithm:
    .
        1. If the input consists of 40 ASCII lowercase hex digits and
           nothing else, return the input.
        2. Else, return the SHA-1 of the input.

    This algorithm has a trivial first preimage attack.  If a wc used this
    identity-then-sha1 algorithm instead of SHA-1, then… what?

> That could happen after a public disclosure of a pair of executable
> files/scripts where the forged version allows for remote code execution.
> Or maybe something similar with a file format that is often stored in
> repositories and that can be executed or used by a build script, etc.
> 

Err, hang on.  Your reference described a chosen-prefix attack, while
this scenario concerns a single public collision.  These are two
different things.

Disclosure of of a pair of executable files/scripts isn't by itself
a problem unless one of the pair ("file A") is in a repository
somewhere.  Now, was the colliding file ("file B") generated _before_ or
_after_ file A was committed?

- If _before_, then it would seem Mallory had somehow managed to:

  1. get a file of his choosing committed to Alice's repository; and

  2. get a wc of Alice's repository into one of the codepaths that
     assume SHA-1 is one-to-one / collission-free (currently that's the
     ra_serf optimization and the 1.15 wc status).

  Now, step #1 seems plausible enough.  As to step #2, it's not clear to
  me how file B would reach the wc in step #2… but insofar as security
  assumptions go, it seems reasonable to assume Mallory can make this
  happen.

  So, I agree it's a scenario we should address.  What options do we
  have to address it?  (I grant that migrating away from SHA-1 is one
  option.)

- If _after_, then you're presuming not simply a collision attack but
  a second preimage attack.  Should we assume Mallory to be able to
  mount a second preimage attack?

Chosen-prefix collision attacks can help Mallory in a variant of the
"before" case: Mallory computes a collision, sends file A to Alice (who
commits it), and invokes his assumed ability to inject file B into
Alice's wc.  This would work for file formats that ignore the unchosen
suffix.

> [1] https://subversion.apache.org/security/sha1-advisory.txt
> [2] https://sha-mbles.github.io/
> 
> 
> Speaking of the proposed switch to SHA-256 or a different checksum, there's
> an argument by contradiction: if we were designing the pristineless working
> copy from scratch today, would we choose SHA-1 as the best available hash
> that can be used to assert content equality?

If we were designing f32 from the ground up, I hope we'd first nail down
our requirements and then check what are the possible ways to address
them.

We might specify that "The probability of birthday collisions in
<USE-CASE> must not exceed <PERCENTAGE>.".

[E.g., my parents named each of their kids for the CRC32 of the
timestamp,longitude,latitude,altitude of that kid's birth, and that was
fine: they had no collisions.  In contrast, sea turtles shouldn't use
CRC32 to name their kids, since they'd have a ≈5% chance of a collision
due to their larger number of offspring.  However, sea turtles would
have no collisions if they used MD5.]

We might specify that "The hash of an <N>-byte file can be computed
within <T> milliseconds on <SUCH AND SUCH> hardware."

[E.g., the existence of https://en.wikipedia.org/wiki/Intel_SHA_extensions
is a consideration.]

We might specify that "An attacker who is capable of <SUCH AND SUCH>
will not be able to cause a false positive or a false negative in the wc
status optimization.".

[E.g., see above about second preimage attacks.]

And then we'd brainstorm possible solutions (plural) and run each of
them through the specifications, which would be our acceptance test
checklist.

(And since we aren't designing from scratch, our actual acceptance test
would also include implementation and maintenance costs for us and
upgrade costs for our users.)

> If yes, how can one prove that?

Well, for starters, rep-sharing was released in in 2009, the first
public collision (shattered) was published in 2017, a chosen-prefix
attack (shambled) in 2020, and we haven't had any complaints since then
<fine print>other than from people literally trying to store
shattered-1.pdf and shattered-2.pdf in their repos</fine print>?

And so long as we're doing thought experiments, here's another: If we
switched to using only MD5 internally, would anyone notice?  (Cf. above
about identity-then-sha1, which is even weaker than MD5.)

> > Objections have been raised, been left unanswered, and now implementation
> > work has commenced following the original design.  That's not acceptable.
> > I'm vetoing the change until a non-rubber-stamp design discussion has
> > been completed on the public dev@ list.
> 
> I would like to note that vetoing a code modification should be accompanied
> with a technical justification, and I have certain doubts that the above
> arguments qualify as such:
> 
> https://www.apache.org/foundation/voting.html
> [[[
> To prevent vetoes from being used capriciously, the voter must provide
> with the veto a technical justification showing why the change is bad
> (opens a security exposure, negatively affects performance, etc. ).
> A veto without a justification is invalid and has no weight.
> ]]]
> 
> (I'm not saying that the above rules have to be used in this particular case
>  and that a veto is invalid, but still thought it’s worth mentioning.)
> 

I vetoed the change because it hadn't been designed on the dev@ list,
had not garnered dev@'s consensus, and was being railroaded through.
(as far as I could tell)

> Anyway, I'll stop working on the branch, because a veto has been casted.

That's your decision.  Implementing one design on a branch while other
options are being considered by dev@ /is/ possible, but there are some
risks with that; cf. my remarks in
<https://mail-archives.apache.org/mod_mbox/subversion-dev/202301.mbox/%3C20230121092231.GA3174%40tarpaulin.shahaf.local2%3E>.

And once again, for clarity: I'm not vetoing migrating away from SHA-1.
(In fact, my intuition was that it'd be a good idea.)

Daniel

> 
> Regards,
> Evgeny Kotkov

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Evgeny Kotkov via dev <de...@subversion.apache.org>.

Daniel Shahaf <d....@daniel.shahaf.name> writes:

> > I can complete the work on this branch and bring it to a production-ready
> > state, assuming there are no objections.
>
> Your assumption is counterfactual:
>
> https://mail-archives.apache.org/mod_mbox/subversion-dev/202301.mbox/%3C20230119152001.GA27446%40tarpaulin.shahaf.local2%3E
>
> https://mail-archives.apache.org/mod_mbox/subversion-dev/202212.mbox/%3CCAMHy98NqYBLZaTL5-FAbf24RR6bagPN1npC5gsZenewZb0-EuQ%40mail.gmail.com%3E

I don't see any explicit objections in these two emails (here I assume that
if something is not clear to a PMC member, it doesn't automatically become
an objection).  If the "why?" question is indeed an objection, then I would
say it has already been discussed and responded to in the thread.

Now, returning to the problem:

As described in the advisory [1], we have a supported configuration that
makes data forgery possible:

- A repository with disabled rep-sharing allows storing different files with
  colliding SHA-1 values.
- Having a repository with disabled rep-sharing is a supported configuration.
  There may be a certain number of such repositories in the wild
  (for example, created with SVN < 1.6 and not upgraded afterwise).
- A working copy uses an assumption that the pristine contents are equal if
  their SHA-1 hashes are equal.
- So committing different files with colliding SHA-1 values makes it possible
  to forge the contents of a file that will be checked-out and used by the
  client.

I would say that this state is worrying just by itself.

However, with the feasibility of chosen-prefix attacks on SHA-1 [2], it's
probably only a matter of time until the situation becomes worse.

That could happen after a public disclosure of a pair of executable
files/scripts where the forged version allows for remote code execution.
Or maybe something similar with a file format that is often stored in
repositories and that can be executed or used by a build script, etc.

[1] https://subversion.apache.org/security/sha1-advisory.txt
[2] https://sha-mbles.github.io/

Speaking of the proposed switch to SHA-256 or a different checksum, there's
an argument by contradiction: if we were designing the pristineless working
copy from scratch today, would we choose SHA-1 as the best available hash
that can be used to assert content equality?  If yes, how can one prove that?

> Objections have been raised, been left unanswered, and now implementation
> work has commenced following the original design.  That's not acceptable.
> I'm vetoing the change until a non-rubber-stamp design discussion has
> been completed on the public dev@ list.

I would like to note that vetoing a code modification should be accompanied
with a technical justification, and I have certain doubts that the above
arguments qualify as such:

https://www.apache.org/foundation/voting.html
[[[
To prevent vetoes from being used capriciously, the voter must provide
with the veto a technical justification showing why the change is bad
(opens a security exposure, negatively affects performance, etc. ).
A veto without a justification is invalid and has no weight.
]]]

(I'm not saying that the above rules have to be used in this particular case
 and that a veto is invalid, but still thought it’s worth mentioning.)

Anyway, I'll stop working on the branch, because a veto has been casted.

Regards,
Evgeny Kotkov

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Karl Fogel <kf...@red-bean.com>.

On 20 Jan 2023, Daniel Shahaf wrote:
>Evgeny Kotkov via dev wrote on Thu, 19 Jan 2023 18:52 +00:00:
>> I can complete the work on this branch and bring it to a 
>> production-ready
>> state, assuming there are no objections.
>
>Your assumption is counterfactual:
>
>https://mail-archives.apache.org/mod_mbox/subversion-dev/202301.mbox/%3C20230119152001.GA27446%40tarpaulin.shahaf.local2%3E
>
>https://mail-archives.apache.org/mod_mbox/subversion-dev/202212.mbox/%3CCAMHy98NqYBLZaTL5-FAbf24RR6bagPN1npC5gsZenewZb0-EuQ%40mail.gmail.com%3E
>
>Objections have been raised, been left unanswered, and now
>implementation work has commenced following the original design. 
>That's
>not acceptable.

I'm a little surprised by your reaction.

It is never "not acceptable" for someone to do implementation work 
on a branch while a discussion is happening, even if that 
discussion contains objections to or questions about the premise 
of the branch work.

It's a branch.  He didn't merge it to trunk, and he posted it as 
an explicit invitation for discussion.

>I'm vetoing the change until a non-rubber-stamp design
>discussion has been completed on the public dev@ list.

Starting an implementation on a branch is a valuable contribution 
to a design discussion -- it's exactly the kind of 
"non-rubber-stamp" contribution one would want.

If you want to re-iterate points you've made that have been left 
unanswered, that would be a useful contribution -- perhaps some of 
those points will be updated now that there's actual code, or 
perhaps they won't.  Either way, what Evgeny is doing here seems 
very constructive to me, and entirely within the normal range of 
how we do things.

Best regards,
-Karl

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Nathan Hartman <ha...@gmail.com>.

On Fri, Jan 20, 2023 at 9:51 AM Nathan Hartman <ha...@gmail.com> wrote:
>
> On Fri, Jan 20, 2023 at 7:18 AM Daniel Shahaf <d....@daniel.shahaf.name> wrote:
> >
> > Evgeny Kotkov via dev wrote on Thu, 19 Jan 2023 18:52 +00:00:
> > > I can complete the work on this branch and bring it to a production-ready
> > > state, assuming there are no objections.
> >
> > Your assumption is counterfactual:
> >
> > https://mail-archives.apache.org/mod_mbox/subversion-dev/202301.mbox/%3C20230119152001.GA27446%40tarpaulin.shahaf.local2%3E
> >
> > https://mail-archives.apache.org/mod_mbox/subversion-dev/202212.mbox/%3CCAMHy98NqYBLZaTL5-FAbf24RR6bagPN1npC5gsZenewZb0-EuQ%40mail.gmail.com%3E
> >
> > Objections have been raised, been left unanswered, and now
> > implementation work has commenced following the original design.  That's
> > not acceptable.  I'm vetoing the change until a non-rubber-stamp design
> > discussion has been completed on the public dev@ list.
>
>
> I think we can start by discussing some of the pros and cons.
>
> There are two separate things here but they end up being mixed
> together in the discussions:
>
> 1. Pros/cons of switching from SHA1 to another hash.
> 2. Supporting different hash types in f32.
>
> Regarding the first item:
>
> Do we need to switch from SHA1 to another hash? One con that was
> already mentioned [1] is that we'll never really be able to switch
> away from SHA1, as there are existing clients, servers, and working
> copies out there. Not only will we have to support SHA1 forever for
> backwards compatibility, but any new hash that is ever added will need
> to be supported forever as well. If we accumulate many of those, it
> might become a burden, but perhaps there will be only one new hash and
> it will be the "blessed" one for the next 20 years.
>
> There were concerns about collisions; since the space of possible
> input datasets is infinite and the hash code size is fixed and finite
> (pretty large, but very much finite), there will always be collisions
> with any hash. The significant questions are: how small is the
> probability of a collision, and (for the purposes of security) how
> hard is it to generate input data that produces a collision? The
> answer to the first question is fixed; the second one is probably
> expected to change over time, as algorithms are studied and new
> vulnerabilities are found. Which hash type do you pick, and who knows
> if a hash thought to be very strong (today) later proves easier to
> crack than one that is thought not as strong? We can only guess.
>
> Taking a step back, this discussion started because pristine-free WCs
> are IIUC more dependent on comparing hashes than pristineful WCs, and
> therefore a hash collision could have more impact in a pristine-free
> WC. "Guarantees" were mentioned, but I think it's important to state
> that there's only a guarantee of probability, since as mentioned above
> all hashes will have collisions.
>
> We already can't store files with identical SHA1 hashes, but AFAIK the
> only meaningful impact we've ever heard is that security researchers
> cannot track files they generate with deliberate collisions. The same
> would be true with any hash type, for collisions within that hash
> type.
>
> Advantages of switching to a new hash type might include: reducing the
> already small probability of collisions; choosing an algorithm that is
> faster or that has (or is expected to have in the future) hardware
> acceleration on commodity systems, perhaps addressing user perception
> (if SHA1 is seen as old and uncool), but then again, we can't really
> get rid of SHA1...
>
> [1] https://lists.apache.org/thread/v3dv1dtod2t9yrf920h4838g2t0l94cw
>
> Regarding the second item:
>
> Since the premise of this feature is to support adding new hash types
> without bumping wc formats, it follows that any new hash type will
> create compatibility problems for clients that support f32 but not the
> specific new hash type. In light of that, it might just be better to
> bump the wc format and then you know at the outset that you need to
> upgrade your client. Just thinking out loud here but this might be
> (partly) mitigated by trying to guess which hash types we might want
> in the future and supporting them now, even if no existing client will
> actually use them, but I don't really like this idea.
>
> I'll have to return later with more thoughts...

Just quickly I want to say that although I mentioned mostly cons
above, I don't want to appear to be against switching hashes nor
against supporting multiple hash types in f32; rather, since the
i525-pod feature necessitated a format bump anyway, I do think it
makes sense to consider adding such changes now, to avoid a future
format bump, and I'm considering arguments contrary to that from a
desire to be unbiased about it.

I have more thoughts (including more pros) but have some things to
attend to now.

Looking forward to hearing others' thoughts as well.

Cheers,
Nathan

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Nathan Hartman wrote on Fri, 20 Jan 2023 14:51 +00:00:
> 1. Pros/cons of switching from SHA1 to another hash.
⋮
> Do we need to switch from SHA1 to another hash? One con that was
> already mentioned [1] is that we'll never really be able to switch
> away from SHA1, as there are existing clients, servers, and working
> copies out there. Not only will we have to support SHA1 forever for
> backwards compatibility,

Actually, I think it's MD5, not SHA-1, that we have to support
indefinitely, since our uses of SHA-1 fall into two categories:

- Accompanied by MD5.  (wc.db PRISTINE table, FSFS node-rev headers,
  dumpfiles' Text-content-* headers)

- An optional optimization.  (ra_serf, rep-cache.db)

>                          but any new hash that is ever added will need
> to be supported forever as well. If we accumulate many of those, it
> might become a burden,

Good point.  Then perhaps we should continue to record two checksums, as
both wc.db and FSFS do?  If we record, say, both «(svn_checksum_kind_t)42»
checksums and «(svn_checksum_kind_t)value_of_the_month» checksums, then
we'll only need to be able to upgrade from the former.

>                        but perhaps there will be only one new hash and
> it will be the "blessed" one for the next 20 years.

Cheers,

Daniel

P.S.  wc-metadata.sql implies that having MD5 collisions in a wc is supported:

     1	/* wc-metadata.sql -- schema used in the wc-metadata SQLite database
     2	 *     This is intended for use with SQLite 3
     ⋮
    94	CREATE TABLE PRISTINE (
    95	  /* The SHA-1 checksum of the pristine text. This is a unique key. The
    96	     SHA-1 checksum of a pristine text is assumed to be unique among all
    97	     pristine texts referenced from this database. */
    98	  checksum  TEXT NOT NULL PRIMARY KEY,
    99	
     ⋮
   114	  /* Alternative MD5 checksum used for communicating with older
   115	     repositories. Not strictly guaranteed to be unique among table rows. */
   116	  md5_checksum  TEXT NOT NULL
   117	  );
   118	
   119	CREATE INDEX I_PRISTINE_MD5 ON PRISTINE (md5_checksum);

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

[ tl;dr: See last paragraph for a concrete question about ra_serf. ]

Karl Fogel wrote on Fri, 20 Jan 2023 17:18 +00:00:
> Yes.  A hash is considered "broken" the moment security researches 
> can generate a collision.

Consider the following uses of hash functions in our code:

- FSFS rep-cache uses SHA-1.

- The ra_serf download optimization uses SHA-1.

- The commit editor uses MD5 in apply_textdelta() and close_file().

The first one is fine, because FSFS rejects collisions in new commits
(as pointed out upthread).

The second one is not necessarily fine: a variation of the attack you (kfogel)
described could make a client wrongly trigger the optimization and end
up with the wrong fulltext.

The third one is fine, because the delta and its resulting fulltext's
checksum don't travel separately.

So, there you have it: a use of SHA-1 which can stay as-is, a use of SHA-1
which may need attention, and a use of MD5 which can stay as-is — all
in the same codebase.

Thus, whether a hash function is "broken" or not depends on the context
in which it is used.

----

To be clear, the ra_serf thing which "may need attention" is the use
of «final_sha1_checksum» in subversion/libsvn_ra_serf/update.c.  That's
a place where we assume SHA-1 is one-to-one.

Cheers,

Daniel

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Nathan Hartman <ha...@gmail.com>.

Replying to multiple parts of this thread...

On Sat, Jan 21, 2023 at 12:58 PM Karl Fogel <kf...@red-bean.com> wrote:
> *nod* This issue isn't important enough to me to continue the
> conversation -- I'd like for new hash algorithms to be possible,
> and I think Evgeny's work on it is worthwhile, but I don't feel
> nearly as strongly about this as I feel about making the new
> pristineless working copies available in an official release as
> soon as we can.

I think it's certainly worthwhile to explore the multi-hash feature,
and if it can be in 1.15, that's good too. But if it will take a while
to *hash* out the details (pun intended) then I'm okay with letting it
wait for a future release in the interest of getting the i525pod
feature out there, even though that means a (possible) future format
bump. (i525pod provides a substantial immediate benefit, while a format
bump isn't necessarily the end of the world).

Having said so, continuing to explore the multi-hash idea:

Previously, I wrote: "Since the premise of this feature is to support
adding new hash types without bumping wc formats, it follows that any
new hash type will create compatibility problems for clients that
support f32 but not the specific new hash type. In light of that, it
might just be better to bump the wc format and then you know at the
outset that you need to upgrade your client. Just thinking out loud
here but this might be (partly) mitigated by trying to guess which hash
types we might want in the future and supporting them now, even if no
existing client will actually use them, but I don't really like this
idea."

I didn't like my own idea at the time, but the following got me
thinking:

On Sun, Jan 22, 2023 at 7:41 AM Daniel Shahaf <d....@daniel.shahaf.name> wrote:
> The server is aware of what algorithm the wc uses on the wire, which is
> SHA-1 in ra_serf's download optimization and MD5 in svn_delta_editor_t::apply_textdelta()
> and svn_delta_editor_t::close_file().  However, the algorithm(s) used by
> the wc for naming pristines and, in f32, for detecting local mods are
> implementation details of the wc.
>
> So, suppose the wc didn't hardcode _any particular_ hash function for
> naming pristines and for status walks — not md5, not sha1, not sha256 —
> but had each «svn checkout» run pick a hash function uniformly at random
> out of a large enough family of hash functions[1].  (Intuitively, think
> of a family of hash functions as a hash function with a random salt,
> similar to [2].)
>
> This way, even if someone tried to deliberately create a collision, they
> wouldn't be able to pick a collision "off the shelf", as with
> shattered.io; they'd need to compute a collision for the specific hash
> function ("salt") used by that particular wc.  That's more difficult than
> creating a collision in a well-known hash function, regardless of
> whether we treat the salt's value as a secret of the wc (as in, stored
> in a mode-0400 file in under .svn directory and not disclosed to the
> server) or as a value the attacker is assumed to know.
>
> So, that's one way to address the scenario kfogel described.

Suppose the wc is made to support multiple hash types, support is added
now for "many" hash types (leaving open the question of "how many" and
which ones for now), and at checkout time, one is chosen, either "at
random" as suggested by danielsh, or, say, by some explicit user
option.

Suppose also that there is a possibility for the user to blacklist some
hash types which the user does not want used at all.

Now, if a specific hash type is later cracked (in the shattered.io
sense), the security fix on SVN's end is to add that hash type to the
default blacklist of hash types. It would still be supported, but new
working copies wouldn't choose it. In the advisory for said fix, we'd
document a workaround for users who can't/won't upgrade: the steps
users can take to blacklist the affected hash types on their systems,
in effect getting the same outcome as upgrading.

One caveat: In either case (whether the user upgrades or applies the
workaround), they'd have to check out new working copies (or maybe run
some invocation of 'svn upgrade') or the existing hashes won't be
changed.

And there's also this:

On Sat, Jan 21, 2023 at 5:25 AM Daniel Shahaf <d....@daniel.shahaf.name> wrote:
> For example, if we used another checksum algorithm, the attacker from
> your scenario might opt to edit the base checksums in .svn/wc.db and
> rename the .svn/pristine/ files accordingly.  That's much easier to pull
> off, and will be easy to adapt if we change the algorithm again, but on
> the other hand, requires write access to the .svn directory and is
> easier to discover.

Yup. Once an attacker has write access to the .svn contents, all bets
are off anyway.

Cheers,
Nathan

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

[See below a proposal that libsvn_wc not use any fixed hash function.]

Martin Edgar Furter Rathod wrote on Sat, 21 Jan 2023 05:22 +00:00:
> On 20.01.23 22:48, Karl Fogel wrote:
>> On 20 Jan 2023, Nathan Hartman wrote:
>>> We already can't store files with identical SHA1 hashes, but AFAIK the
>>> only meaningful impact we've ever heard is that security researchers
>>> cannot track files they generate with deliberate collisions. The same
>>> would be true with any hash type, for collisions within that hash
>>> type.
>> 
>> Yes.  A hash is considered "broken" the moment security researches can 
>> generate a collision.
>
> No matter what hash function you choose now, sooner or later it will be 
> broken.
>
> But a broken hash function can still be good enough for use in tools 
> like subversion if it is used correctly. Instead of just storing the 
> hash value subversion should also store a sequence number. Whenever a 
> collision happens subversion has to compare the two (or more) files 
> which have the same hash value.

So, basically, just do what the implementation of hashes (the data
structure mapping keys to values) does?

I think this would work in most of our uses of checksums, and make it
possible to have collisions in both the repository and the wc.

However, what about running `svn status` when there's an unhydrated file
that has been modified in a way that changes the fulltext but doesn't
change the checksum value?  In this case the BASE fulltext isn't
available locally to compare with.

----

I think there is actually something we can do about this: stop
hardcoding any particular hash function in libsvn_wc's internals.

The server is aware of what algorithm the wc uses on the wire, which is
SHA-1 in ra_serf's download optimization and MD5 in svn_delta_editor_t::apply_textdelta()
and svn_delta_editor_t::close_file().  However, the algorithm(s) used by
the wc for naming pristines and, in f32, for detecting local mods are
implementation details of the wc.

So, suppose the wc didn't hardcode _any particular_ hash function for
naming pristines and for status walks — not md5, not sha1, not sha256 —
but had each «svn checkout» run pick a hash function uniformly at random
out of a large enough family of hash functions[1].  (Intuitively, think
of a family of hash functions as a hash function with a random salt,
similar to [2].)

This way, even if someone tried to deliberately create a collision, they
wouldn't be able to pick a collision "off the shelf", as with
shattered.io; they'd need to compute a collision for the specific hash
function ("salt") used by that particular wc.  That's more difficult than
creating a collision in a well-known hash function, regardless of
whether we treat the salt's value as a secret of the wc (as in, stored
in a mode-0400 file in under .svn directory and not disclosed to the
server) or as a value the attacker is assumed to know.

So, that's one way to address the scenario kfogel described.

Thanks for speaking up, Martin.

Daniel

[1] I'm not making this term up; see, for instance, page 143 of
    https://cseweb.ucsd.edu/~mihir/papers/gb.pdf.  "𝒦" is keyspace,
    "D" is domain, "R" is range.  A random element K ∈ 𝒦 is chosen and the
    hash function H_K [aka H with currying of the first parameter] is
    used thereafter.

[2]
    def f(foo):
        return sha1(str(foo) + f.salt)
    f.salt = str(random_thing())

> If the files are identical the old 
> hash+number pair is stored. If they differ the new file gets a new 
> sequence number and that hash+number pair is stored. Since collisions 
> almost never happen even if md5 is used the performance penalty will be 
> almost zero.
>
> The same thing has been discussed earlier and changing the hash function 
> will just solve the problem for a few years...
>
> Best regards,
> Martin

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Martin Edgar Furter Rathod <mf...@apache.org>.

On 20.01.23 22:48, Karl Fogel wrote:
> On 20 Jan 2023, Nathan Hartman wrote:
>> We already can't store files with identical SHA1 hashes, but AFAIK the
>> only meaningful impact we've ever heard is that security researchers
>> cannot track files they generate with deliberate collisions. The same
>> would be true with any hash type, for collisions within that hash
>> type.
> 
> Yes.  A hash is considered "broken" the moment security researches can 
> generate a collision.

No matter what hash function you choose now, sooner or later it will be 
broken.

But a broken hash function can still be good enough for use in tools 
like subversion if it is used correctly. Instead of just storing the 
hash value subversion should also store a sequence number. Whenever a 
collision happens subversion has to compare the two (or more) files 
which have the same hash value. If the files are identical the old 
hash+number pair is stored. If they differ the new file gets a new 
sequence number and that hash+number pair is stored. Since collisions 
almost never happen even if md5 is used the performance penalty will be 
almost zero.

The same thing has been discussed earlier and changing the hash function 
will just solve the problem for a few years...

Best regards,
Martin

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Karl Fogel <kf...@red-bean.com>.

On 20 Jan 2023, Nathan Hartman wrote:
>Taking a step back, this discussion started because pristine-free 
>WCs
>are IIUC more dependent on comparing hashes than pristineful WCs, 
>and
>therefore a hash collision could have more impact in a 
>pristine-free
>WC. "Guarantees" were mentioned, but I think it's important to 
>state
>that there's only a guarantee of probability, since as mentioned 
>above
>all hashes will have collisions.

Sure, in a literal mathematical sense, but not in a sense that 
matters for our purposes here.

In the absence of an intentionally caused collision, a good hash 
function has *far* less chance of accidental collision than, say, 
the chance that your CPU will malfunction due to a stray cosmic 
ray, or the chance of us getting hit by a planet-destroying 
meteorite tomorrow.

For our purposes, "guarantee" is accurate.  No guarantee we make 
can be stonger than the inverse probability of a CPU/memory 
malfunction anyway.

>We already can't store files with identical SHA1 hashes, but 
>AFAIK the
>only meaningful impact we've ever heard is that security 
>researchers
>cannot track files they generate with deliberate collisions. The 
>same
>would be true with any hash type, for collisions within that hash
>type.

Yes.  A hash is considered "broken" the moment security researches 
can generate a collision.

FWIW, in one of my previous posts, I described a real-life 
scenario in which the ability to generate a chosen-plaintext 
collision in an SVN working copy would have security implications.

Best regards,
-Karl

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Nathan Hartman <ha...@gmail.com>.

On Fri, Jan 20, 2023 at 7:18 AM Daniel Shahaf <d....@daniel.shahaf.name> wrote:
>
> Evgeny Kotkov via dev wrote on Thu, 19 Jan 2023 18:52 +00:00:
> > I can complete the work on this branch and bring it to a production-ready
> > state, assuming there are no objections.
>
> Your assumption is counterfactual:
>
> https://mail-archives.apache.org/mod_mbox/subversion-dev/202301.mbox/%3C20230119152001.GA27446%40tarpaulin.shahaf.local2%3E
>
> https://mail-archives.apache.org/mod_mbox/subversion-dev/202212.mbox/%3CCAMHy98NqYBLZaTL5-FAbf24RR6bagPN1npC5gsZenewZb0-EuQ%40mail.gmail.com%3E
>
> Objections have been raised, been left unanswered, and now
> implementation work has commenced following the original design.  That's
> not acceptable.  I'm vetoing the change until a non-rubber-stamp design
> discussion has been completed on the public dev@ list.

I think we can start by discussing some of the pros and cons.

There are two separate things here but they end up being mixed
together in the discussions:

1. Pros/cons of switching from SHA1 to another hash.
2. Supporting different hash types in f32.

Regarding the first item:

Do we need to switch from SHA1 to another hash? One con that was
already mentioned [1] is that we'll never really be able to switch
away from SHA1, as there are existing clients, servers, and working
copies out there. Not only will we have to support SHA1 forever for
backwards compatibility, but any new hash that is ever added will need
to be supported forever as well. If we accumulate many of those, it
might become a burden, but perhaps there will be only one new hash and
it will be the "blessed" one for the next 20 years.

There were concerns about collisions; since the space of possible
input datasets is infinite and the hash code size is fixed and finite
(pretty large, but very much finite), there will always be collisions
with any hash. The significant questions are: how small is the
probability of a collision, and (for the purposes of security) how
hard is it to generate input data that produces a collision? The
answer to the first question is fixed; the second one is probably
expected to change over time, as algorithms are studied and new
vulnerabilities are found. Which hash type do you pick, and who knows
if a hash thought to be very strong (today) later proves easier to
crack than one that is thought not as strong? We can only guess.

Taking a step back, this discussion started because pristine-free WCs
are IIUC more dependent on comparing hashes than pristineful WCs, and
therefore a hash collision could have more impact in a pristine-free
WC. "Guarantees" were mentioned, but I think it's important to state
that there's only a guarantee of probability, since as mentioned above
all hashes will have collisions.

We already can't store files with identical SHA1 hashes, but AFAIK the
only meaningful impact we've ever heard is that security researchers
cannot track files they generate with deliberate collisions. The same
would be true with any hash type, for collisions within that hash
type.

Advantages of switching to a new hash type might include: reducing the
already small probability of collisions; choosing an algorithm that is
faster or that has (or is expected to have in the future) hardware
acceleration on commodity systems, perhaps addressing user perception
(if SHA1 is seen as old and uncool), but then again, we can't really
get rid of SHA1...

[1] https://lists.apache.org/thread/v3dv1dtod2t9yrf920h4838g2t0l94cw

Regarding the second item:

Since the premise of this feature is to support adding new hash types
without bumping wc formats, it follows that any new hash type will
create compatibility problems for clients that support f32 but not the
specific new hash type. In light of that, it might just be better to
bump the wc format and then you know at the outset that you need to
upgrade your client. Just thinking out loud here but this might be
(partly) mitigated by trying to guess which hash types we might want
in the future and supporting them now, even if no existing client will
actually use them, but I don't really like this idea.

I'll have to return later with more thoughts...

Cheers,
Nathan

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Evgeny Kotkov via dev wrote on Thu, 19 Jan 2023 18:52 +00:00:
> I can complete the work on this branch and bring it to a production-ready
> state, assuming there are no objections.

Your assumption is counterfactual:

https://mail-archives.apache.org/mod_mbox/subversion-dev/202301.mbox/%3C20230119152001.GA27446%40tarpaulin.shahaf.local2%3E

https://mail-archives.apache.org/mod_mbox/subversion-dev/202212.mbox/%3CCAMHy98NqYBLZaTL5-FAbf24RR6bagPN1npC5gsZenewZb0-EuQ%40mail.gmail.com%3E

Objections have been raised, been left unanswered, and now
implementation work has commenced following the original design.  That's
not acceptable.  I'm vetoing the change until a non-rubber-stamp design
discussion has been completed on the public dev@ list.

Daniel

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Karl Fogel <kf...@red-bean.com>.

On 19 Jan 2023, Evgeny Kotkov wrote:
>To have a more or less accurate estimate, I went ahead and 
>prepared the
>first-cut implementation of an approach that makes the pristine 
>checksum
>kind configurable in a working copy.
>
>The current implementation passes all tests in my environment and 
>seems to
>work in practice.  It is available on the branch:
>
>  https://svn.apache.org/repos/asf/subversion/branches/pristine-checksum-kind
>
>The implementation on the branch allows creating working copies 
>that use a
>checksum kind other than SHA-1.
>
>The checksum kind is persisted in the settings table.  Upgraded 
>working copies
>of the older formats will have SHA-1 recorded as their pristine 
>checksum kind
>and will continue to use it for compatibility.  Newly created 
>working copies
>of the latest format (with --compatible-version=1.15 or 
>--store-pristine=no),
>as currently implemented, will use the new pristine checksum 
>kind.
>
>Currently, as a proof-of-concept, the branch uses salted SHA-1 as 
>the new
>pristine checksum kind.  For the production-ready state, I plan 
>to support
>using multiple new checksum types such as SHA-256.  I think that 
>it would
>be useful for future compatibility, because if we encounter any 
>issues with
>one checksum kind, we could then switch to a different kind 
>without having
>to change the working copy format.
>
>One thing worth noting is that ra_serf contains a specific 
>optimization for
>the skelta-style updates that allows skipping a GET request if 
>the pristine
>store already contains an entry with the specified SHA-1 
>checksum.  Switching
>to a different checksum type for the pristine entries is going to 
>disable
>that specific optimization.  Re-enabling it would require an 
>update of the
>server-side.  I consider this to be out of scope for this branch.
>
>I can complete the work on this branch and bring it to a 
>production-ready
>state, assuming there are no objections.

This sounds great to me; thank you, Evgeny.  I agree that the 
server-side companion change is (or anyway can be) out-of-scope 
here -- the perfect should not be the enemy of the good, etc.

Best regards,
-Karl

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Evgeny Kotkov via dev <de...@subversion.apache.org>.

Karl Fogel <kf...@red-bean.com> writes:

> Now, how hard would this be to actually implement?

To have a more or less accurate estimate, I went ahead and prepared the
first-cut implementation of an approach that makes the pristine checksum
kind configurable in a working copy.

The current implementation passes all tests in my environment and seems to
work in practice.  It is available on the branch:

  https://svn.apache.org/repos/asf/subversion/branches/pristine-checksum-kind

The implementation on the branch allows creating working copies that use a
checksum kind other than SHA-1.

The checksum kind is persisted in the settings table.  Upgraded working copies
of the older formats will have SHA-1 recorded as their pristine checksum kind
and will continue to use it for compatibility.  Newly created working copies
of the latest format (with --compatible-version=1.15 or --store-pristine=no),
as currently implemented, will use the new pristine checksum kind.

Currently, as a proof-of-concept, the branch uses salted SHA-1 as the new
pristine checksum kind.  For the production-ready state, I plan to support
using multiple new checksum types such as SHA-256.  I think that it would
be useful for future compatibility, because if we encounter any issues with
one checksum kind, we could then switch to a different kind without having
to change the working copy format.

One thing worth noting is that ra_serf contains a specific optimization for
the skelta-style updates that allows skipping a GET request if the pristine
store already contains an entry with the specified SHA-1 checksum.  Switching
to a different checksum type for the pristine entries is going to disable
that specific optimization.  Re-enabling it would require an update of the
server-side.  I consider this to be out of scope for this branch.

I can complete the work on this branch and bring it to a production-ready
state, assuming there are no objections.


Thanks,
Evgeny Kotkov

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Karl Fogel <kf...@red-bean.com>.

On 29 Dec 2022, Evgeny Kotkov wrote:
>Karl Fogel <kf...@red-bean.com> writes:
>
>> Now, how hard would this be to actually implement?
>
>I plan to take a more detailed look at that, but I'm currently on 
>vacation
>for the New Year holidays.

That's great to hear, Evgeny.  In the meantime, enjoy your 
vacation!

Best regards,
-Karl

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Evgeny Kotkov via dev <de...@subversion.apache.org>.

Karl Fogel <kf...@red-bean.com> writes:

> Now, how hard would this be to actually implement?

I plan to take a more detailed look at that, but I'm currently on vacation
for the New Year holidays.


Thanks,
Evgeny Kotkov

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Karl Fogel <kf...@red-bean.com>.

On 28 Dec 2022, Branko Čibej wrote:
>My point was that we shouldn't have to worry about format bumps 
>as
>much any more because we have infrastructure in the client for
>supporting multiple WC formats. That includes optional pristines,
>different hashes, compressed pristines, etc. etc.

Thank you for the reminder -- that is indeed important here.

On 28 Dec 2022, Daniel Sahlberg wrote:
>Since we need to be backwards compatible with older v1 clients, 
>can
>this check ever be removed (before Subversion 2)?
>
>So, while I believe f32 is a good opportunity to switch to a new
>hash, what is the problem we would like to solve with a new hash?

As I said before, even if we couldn't think of a concrete problem 
right now, the mere fact that a former guarantee [1] has become a 
non-guarantee is enough motivation.  We can't anticipate all the 
problems that might arise from people being able to craft local 
content that looks unmodified to Subversion.  (As you implied, 
r1794611 has no effect for content that is never committed to the 
repository.)

Of course, my saying "This matters just through reasoning from 
first principles, therefore we should fix it" would count for a 
lot more if I were volunteering to fix it, which I'm not alas. 
But I do think we don't need to search further for justifications. 
What we already know is enough: our hash algorithm is known to be 
collidable, yet what we're using it for depends on 
non-collidability; therefore, switching to a better algorithm is a 
good idea.

However, it needn't be a blocker for the next release, for the 
reason Brane gave.

Best regards,
-Karl

[1] "Former guarantee" meaning "former guarantee for all practical 
purposes", of course, since in the past there weren't ways to make 
collisions happen.

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Branko Čibej <br...@apache.org>.

On 28.12.2022 13:34, Daniel Sahlberg wrote:
> Since we need to be backwards compatible with older v1 clients, can 
> this check ever be removed (before Subversion 2)?

The case you're citing is specific to the repository, you could easily 
have a repository format that uses different hashes. The same for the RA 
layer, where we have capability negotiation; likewise for the WC. We'll 
always need compatibility with older formats, but a new enough client 
and server could use, e.g., SHA-256 or -512 all the way from WC to 
repository.

> So, while I believe f32 is a good opportunity to switch to a new hash, 
> what is the problem we would like to solve with a new hash?

On the other hand, there can be no "switching to" a new hash, because 
you don't know what the server actually supports -- hence, we'll always 
have to keep SHA-1 around. :) IMO Karl described one possible attack 
vector, and given the context (Wordpress...) it's probably only a matter 
of time before it happens.

-- Brane

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Daniel Sahlberg <da...@gmail.com>.

Den ons 28 dec. 2022 kl 08:48 skrev Branko Čibej <br...@apache.org>:

> On 27.12.2022 02:56, Karl Fogel wrote:
>
> Now, how hard would this be to actually implement?  The
> pristineless-format WC upgrade is an opportunity to make other format
> changes, but I'd hate to block the release of pristineless working copies
> on this...
>
>
> My point was that we shouldn't have to worry about format bumps as much
> any more because we have infrastructure in the client for supporting
> multiple WC formats. That includes optional pristines, different hashes,
> compressed pristines, etc. etc.
>

Evgeny has a point that when going from 31 to 32, we know that all
pristines are there and we can rehash them in place. If/when we create
format X with the new XYZ-hash, we either have to download all missing
pristines or we have to support multiple hashes for each file.

I've been thinking about this question and while I don't know all
background, it seems to be two different questions:
- Detecting changes in the WC. Karl has an excellent scenario where this
might be a problem, but switching to a new hash only makes this scenario
more expensive. Thus: What is the definition of "expensive enough"? I
believe this is a different way of asking the same question posed by
DanielSh about the criteria for a new hash.
- Storing files with hash collisions. Subversion prevents this (with
E160067) and as far as I understand this is because of r1794611 (by Stefan
Sperling) and the log message argues:

[[[
However, similar problems still exist in (at least) the RA layer and the
working copy. Until those are fixed, rejecting content which causes a hash
collision is the safest approach and avoids the undesired consequences of
storing such content.
]]]

Since we need to be backwards compatible with older v1 clients, can this
check ever be removed (before Subversion 2)?

So, while I believe f32 is a good opportunity to switch to a new hash, what
is the problem we would like to solve with a new hash?

Kind regards,
Daniel Sahlberg

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Branko Čibej <br...@apache.org>.

On 27.12.2022 02:56, Karl Fogel wrote:
> Now, how hard would this be to actually implement?  The 
> pristineless-format WC upgrade is an opportunity to make other format 
> changes, but I'd hate to block the release of pristineless working 
> copies on this...

My point was that we shouldn't have to worry about format bumps as much 
any more because we have infrastructure in the client for supporting 
multiple WC formats. That includes optional pristines, different hashes, 
compressed pristines, etc. etc.

-- Brane

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Karl Fogel <kf...@red-bean.com>.

On 20 Dec 2022, Evgeny Kotkov via dev wrote:
>[Moving discussion to a new thread]
>
>We currently have a problem that a working copy relies on the 
>checksum type
>with known collisions (SHA1).  A solution to that problem is to 
>switch to a
>different checksum type without known collisions in one of the 
>newer working
>copy formats.
>
>Since we plan on shipping a new working copy format in 1.15, this 
>seems to
>be an appropriate moment of time to decide whether we'd also want 
>to switch
>to a checksum type without known collisions in that new format.
>
>Below are the arguments for including a switch to a different 
>checksum type
>in the working copy format for 1.15:
>
>1) Since the "is the file modified?" check now compares 
>checksums, leaving
>   everything as-is may be considered a regression, because it 
>   would
>   introduce additional cases where a working copy currently 
>   relies on
>   comparing checksums with known collisions.
>
>2) We already need a working copy format bump for the 
>pristines-on-demand
>   feature.  So using that format bump to solve the SHA1 issue 
>   might reduce
>   the overall number of required bumps for users (assuming that 
>   we'll still
>   need to switch from SHA1 at some point later).
>
>3) While the pristines-on-demand feature is not released, 
>upgrading with a
>   switch to the new checksum type seems to be possible without 
>   requiring a
>   network fetch.  But if some of the pristines are optional, we 
>   lose the
>   possibility to rehash all contents in place.  So we might find 
>   ourselves
>   having to choose between two worse alternatives of either 
>   requiring a
>   network fetch during upgrade or entirely prohibiting an 
>   upgrade of
>   working copies with optional pristines.
>
>Thoughts?

A few thoughts:

First, Daniel Shahaf raises the question of whether there is 
really a problem here.  I.e., Why do we care about possible 
collisions when they're unlikely to happen in practice unless 
deliberately caused?

My answer is: we should care because it's very difficult to 
imagine all the consequences -- including but not limited to 
clever deliberate attacks -- that might follow from losing a 
property we formerly had.  The hash semantics we have always 
assumed are "If the file is modified, the hash will change."  When 
those semantics change, we don't need to be able to think 
immediately of a specific problematic scenario to know that this 
is a significant development.  We've lost the guarantee; that's 
enough to be worth worrying about.

BUT, if you want a scenario, here's one:

I have put WordPress installations under Subversion version 
control before.  Once, I detected an attack on one of those 
WordPress servers when one of the things the attacker did was 
modify some of the WordPress scripts on the server.  Those files 
showed up as modified when I ran 'svn st', and from there I ran 
'svn diff' and figured out what had happened.  But a super-careful 
attacker could make modifications that leave the 
version-controlled files with the same SHA1 hash they had before, 
thus making it harder to detect the attack.

Yes, I realize there are other ways to detect modifications, and 
that random attackers are unlikely to take the trouble to preserve 
hashes.  On the other hand, a well-resourced spear-fishing 
attacker who knows something about the usage of SVN at their 
target might indeed try a hash-preserving approach to breaking in. 
The point is, if we're counting on the hashes having certain 
semantics, then our users are counting on it too.  If SHA1 no 
longer has those semantics, we should upgrade.

Second, +1 to what Branko said: we should upgrade to a new hash 
when we upgrade a working copy anyway, but new clients should 
still be able to handle the old hash in old working copies without 
upgrading them.

Now, how hard would this be to actually implement?  The 
pristineless-format WC upgrade is an opportunity to make other 
format changes, but I'd hate to block the release of pristineless 
working copies on this...

Best regards,
-Karl

Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format (was: Re: Getting to first release of pristines-on-demand feature (#525).)

Posted by Evgeny Kotkov via dev <de...@subversion.apache.org>.

Karl Fogel <kf...@red-bean.com> writes:

> > While here, I would like to raise a topic of incorporating a switch from
> > SHA1 to a different checksum type (without known collisions) for the new
> > working copy format.  This topic is relevant to the pristines-on-demand
> > branch, because the new "is the file modified?" check relies on the
> > checksum comparison, instead of comparing the contents of working and
> > pristine files.
> >
> > And so while I consider it to be out of the scope of the pristines-on-
> > demand branch, I think that we might want to evaluate if this is something
> > that should be a part of the next release.
>
> Good point.  Maybe worth a new thread?

[Moving discussion to a new thread]

We currently have a problem that a working copy relies on the checksum type
with known collisions (SHA1).  A solution to that problem is to switch to a
different checksum type without known collisions in one of the newer working
copy formats.

Since we plan on shipping a new working copy format in 1.15, this seems to
be an appropriate moment of time to decide whether we'd also want to switch
to a checksum type without known collisions in that new format.

Below are the arguments for including a switch to a different checksum type
in the working copy format for 1.15:

1) Since the "is the file modified?" check now compares checksums, leaving
   everything as-is may be considered a regression, because it would
   introduce additional cases where a working copy currently relies on
   comparing checksums with known collisions.

2) We already need a working copy format bump for the pristines-on-demand
   feature.  So using that format bump to solve the SHA1 issue might reduce
   the overall number of required bumps for users (assuming that we'll still
   need to switch from SHA1 at some point later).

3) While the pristines-on-demand feature is not released, upgrading with a
   switch to the new checksum type seems to be possible without requiring a
   network fetch.  But if some of the pristines are optional, we lose the
   possibility to rehash all contents in place.  So we might find ourselves
   having to choose between two worse alternatives of either requiring a
   network fetch during upgrade or entirely prohibiting an upgrade of
   working copies with optional pristines.

Thoughts?

Thanks,
Evgeny Kotkov

Re: Getting to first release of pristines-on-demand feature (#525).

Posted by Karl Fogel <kf...@red-bean.com>.

On 07 Dec 2022, Evgeny Kotkov wrote:
>Evgeny Kotkov <ev...@visualsvn.com> writes:
>I think that the `pristines-on-demand-on-mwf` branch is now ready 
>for a
>merge to trunk.  I could do that, assuming there are no 
>objections.

+1, and thank you.  

Now, I haven't had time to do a real code review -- my manager hat 
gets tighter every year -- so my "+1" is mainly a sign of 
enthusiasm for the feature, and of general trust in our test suite 
and in everyone who has worked on this.

>  https://svn.apache.org/repos/asf/subversion/branches/pristines-on-demand-on-mwf
>
>The branch includes the following:
>– Core implementation of the new mode where required pristines 
>are fetched
>  at the beginning of the operation.
>– A new --store-pristine=yes/no option for `svn checkout` that is 
>persisted
>  as a working copy setting.

+1 to this UI.  We can offer other gateways to this feature later, 
but this is a clean & simple way to start out.

>– An update for `svn info` to display the value of this new 
>setting.

Yay.

>– A standalone test harness that tests main operations in both
>  --store-pristine modes and gets executed on every test run.
>– A new --store-pristine=yes/no option for the test suite that 
>forces all
>  tests to run with a specific pristine mode.

Very nice. 

>The branch passes all tests in my Windows and Linux environments, 
>in both
>--store-pristine=yes and =no modes.

W00t!

>While here, I would like to raise a topic of incorporating a 
>switch from
>SHA1 to a different checksum type (without known collisions) for 
>the new
>working copy format.  This topic is relevant to the 
>pristines-on-demand
>branch, because the new "is the file modified?" check relies on 
>the checksum
>comparison, instead of comparing the contents of working and 
>pristine files.
>
>And so while I consider it to be out of the scope of the 
>pristines-on-demand
>branch, I think that we might want to evaluate if this is 
>something that
>should be a part of the next release.

Good point.  Maybe worth a new thread?

Best regards,
-Karl

Re: Getting to first release of pristines-on-demand feature (#525).

Posted by Evgeny Kotkov via dev <de...@subversion.apache.org>.

Evgeny Kotkov <ev...@visualsvn.com> writes:

> > IMHO, once the tests are ready, we could merge it and release
> > it to the world.
>
> Apart from the required test changes, there are some technical
> TODOs that remain from the initial patch and should be resolved.
> I'll try to handle them as well.

I think that the `pristines-on-demand-on-mwf` branch is now ready for a
merge to trunk.  I could do that, assuming there are no objections.

  https://svn.apache.org/repos/asf/subversion/branches/pristines-on-demand-on-mwf

The branch includes the following:
– Core implementation of the new mode where required pristines are fetched
  at the beginning of the operation.
– A new --store-pristine=yes/no option for `svn checkout` that is persisted
  as a working copy setting.
– An update for `svn info` to display the value of this new setting.
– A standalone test harness that tests main operations in both
  --store-pristine modes and gets executed on every test run.
– A new --store-pristine=yes/no option for the test suite that forces all
  tests to run with a specific pristine mode.

The branch passes all tests in my Windows and Linux environments, in both
--store-pristine=yes and =no modes.

While here, I would like to raise a topic of incorporating a switch from
SHA1 to a different checksum type (without known collisions) for the new
working copy format.  This topic is relevant to the pristines-on-demand
branch, because the new "is the file modified?" check relies on the checksum
comparison, instead of comparing the contents of working and pristine files.

And so while I consider it to be out of the scope of the pristines-on-demand
branch, I think that we might want to evaluate if this is something that
should be a part of the next release.

Thanks,
Evgeny Kotkov

Re: Getting to first release of pristines-on-demand feature (#525).

Posted by Karl Fogel <kf...@red-bean.com>.

On 29 Nov 2022, Johan Corveleyn wrote:
>My thanks also to the courageous people having developed this, 
>and the
>gentle souls keeping the ball rolling :-).
>
>About the name:
>
>> [...]
>
>FWIW, my vote still goes to --store-pristines={yes|no}

Same here, FWIW.

I understand the argument that this exposes an "implementation 
detail" that the user is supposed to not need to think about.  But 
remember, the reason we developed this feature is because the user 
was *already* exposed to the existence of pristines: disk space 
usage by pristines is quite visible to the user -- that's the 
whole problem :-).

So only users who already "see" pristines -- that is, who are 
already aware of the storage issue -- would go looking for this 
feature in the first place.  So by the time they learn about the 
'--store-pristines' option, they're already being forced to deal 
with pristines as a concept, and the only question is whether the 
tool we give them to solve their problem will take advantage of 
that conceptual familiarity.

So, +1 to "--store-pristines=foo".

>I prefer such an explicit option here, rather than vague ones 
>that
>could cover many different things. Also, --optimize=X can easily 
>be
>interpreted inversely as intended (for instance: when I have an
>optimal network, do I use --optimize=network?)
>
>Apart from {yes|no} the feature might grow other option values in 
>the
>future ('size-based' or 'text-only', or maybe simply 'auto' if we 
>come
>up with a good general strategy that works for 99% of the cases, 
>the
>details of which we don't want to burden our users with). We 
>could
>even, in some distant future, allow user-defined names that are
>specified in ~/.subversion/config by the user (using some syntax 
>where
>the user can set configurable size limits or mime-types or 
>whatever).

I also agree with Johan's point here.

>One other suggestion: not a blocker of course, but a
>runtime-config-area default would be nice :-). Users might want 
>to
>choose the same option all the time, without having to remember 
>to add
>the option to their checkout command.
>
>Something like, in ~/.suversion/config
>
>store-pristines-default={yes|no}

Later on, this might grow into more sophisticated local run-time 
config regarding pristines, but for now, providing this basic 
yes/no default is a good idea.  For example, on machines where one 
is regularly checking out trees with huge files, one might set the 
default to "no".

Best regards,
-Karl

Re: Getting to first release of pristines-on-demand feature (#525).

Posted by Johan Corveleyn <jc...@gmail.com>.

My thanks also to the courageous people having developed this, and the
gentle souls keeping the ball rolling :-).

About the name:

On Thu, Nov 24, 2022 at 3:57 PM Nathan Hartman <ha...@gmail.com> wrote:
...
> Previously we got stuck trying to choose the user-facing name of this
> feature and its command line switches.
>
> Currently the CLI switch is --store-pristine={yes|no}.
>
> I'm okay with this, but for completeness I'll mention that earlier in
> the year there was a little bit of push back because pristines, up
> until now, have been an internal implementation detail that users
> needn't concern themselves with. (Except that they double the storage
> space...)
>
> I've been trying to think of something better for months now, and
> here's what I've come up with:
>
> --optimize=storage
> --optimize=network

FWIW, my vote still goes to --store-pristines={yes|no}

I prefer such an explicit option here, rather than vague ones that
could cover many different things. Also, --optimize=X can easily be
interpreted inversely as intended (for instance: when I have an
optimal network, do I use --optimize=network?)

Apart from {yes|no} the feature might grow other option values in the
future ('size-based' or 'text-only', or maybe simply 'auto' if we come
up with a good general strategy that works for 99% of the cases, the
details of which we don't want to burden our users with). We could
even, in some distant future, allow user-defined names that are
specified in ~/.subversion/config by the user (using some syntax where
the user can set configurable size limits or mime-types or whatever).

One other suggestion: not a blocker of course, but a
runtime-config-area default would be nice :-). Users might want to
choose the same option all the time, without having to remember to add
the option to their checkout command.

Something like, in ~/.suversion/config

store-pristines-default={yes|no}

Just my 2 cents of course ...
-- 
Johan

Re: Getting to first release of pristines-on-demand feature (#525).

Posted by Nathan Hartman <ha...@gmail.com>.

On Wed, Nov 23, 2022 at 9:53 AM Julian Foad <ju...@apache.org> wrote:
> Nathan, I see you replied enthusiastically and mentioned "I have much to
> say on both of these [TODOs] but I won't go into detail yet...". It
> seems to me it could be helpful to get that started sooner rather than
> later, too, if those issues still need hashing out.

Thanks for the nudge.

Previously we got stuck trying to choose the user-facing name of this
feature and its command line switches.

Currently the CLI switch is --store-pristine={yes|no}.

I'm okay with this, but for completeness I'll mention that earlier in
the year there was a little bit of push back because pristines, up
until now, have been an internal implementation detail that users
needn't concern themselves with. (Except that they double the storage
space...)

I've been trying to think of something better for months now, and
here's what I've come up with:

--optimize=storage
--optimize=network

Rationale:

* Self-documenting.

* Easy to explain: --optimize=storage saves storage space;
  --optimize=network reduces network accesses to the repository
  server.

* Users don't need to know about pristines. There aren't several levels
  of abstraction between the option name and why the user cares about
  it.

* Extensible. Maybe we can think of other ways to optimize for network
  bandwidth, for example.

The docs can give more user-facing explanation, including tradeoffs,
which SVN operations are affected, and example scenarios to help users
choose. It should be much easier to write -- and read -- than what we
currently have at the draft release notes [1].

As for example scenarios, while the original premise was to save space
on large files that don't change often, i525pod is also great in other
situations, such as checking out a large source tree on a ramdrive
(limited space), or on the same machine as the repo, or on a storage-
limited embedded device. (I've tried i525pod in all 3 of these
scenarios!)

Downsides:

* Admittedly, --optimize=network isn't the best name in all scenarios.
  Notably, this is a misnomer when the repository server is on the same
  machine as the working copy, but that might not matter because it's
  the default. (And I might suggest trying --optimize=storage in that
  scenario).

* If we ever want to do other cool things with pristines, such as an
  option to keep more locally cached history, these names won't be
  right for that.

* These option names haven't helped me come up with a better name for
  the feature itself.

There is an advantage to using --store-pristine={yes|no}: We don't need
to rename the feature because Pristines On Demand and the CLI options
are named similarly.

The disadvantage of --store-pristine={yes|no} is that the feature is
more burdensome for us to explain and for others to learn about,
especially from a non-technical standpoint. How would you explain this
feature in a press release, or in a short blurb (or dare I say, tweet)
about "What's new in Subversion 1.15?"

Some other possibilities that were discussed:

I'll mention these for completeness but note that if --optimize=x is
shot down, I'd rather use --store-pristine={yes|no} than any of these:

* Hydrate and dehydrate -- perhaps the terms that appear most in dev
  discussions. I don't recommend these in user-facing areas because
  they aren't self-documenting. Users can't deduce what these actually
  do for the user. Users might mistakenly think that their working
  files would be hydrated or dehydrated in some way. Users would have
  to learn about pristines to know what is being hydrated or
  dehydrated, eliminating any useful abstraction.

* "Bare working copies" -- the draft release notes [1] use this term
  tentatively to explain that "bare" working copies save storage by not
  caching "BASE" files. Unfortunately, "bare" and "BASE" differ by only
  one letter (and capitalization) and I feel like the explanation is
  too complicated and doesn't bring us closer to a good result.

* Briefly discussed: "local BASE" or "remote BASE" -- but that's a
  misnomer because there's no such thing as "remote" BASE.

Well, you've been warned that I have much to say. :-)

Cheers,
Nathan

Re: Getting to first release of pristines-on-demand feature (#525).

Posted by Julian Foad <ju...@apache.org>.

I'm glad to see you all picking up this project again. While working on
this at the beginning of the year I turned on the pristines-on-demand
mode in some of my own WCs such as my 'Documents' tree which includes
lots of scanned paper docs. It works nicely for cases like this, and
feels right, the pristine store being mostly unpopulated when the
working files are mostly unchanging.

I meant to check back with you during the year, how we should take it
forward. The recent summary in this thread sounds about right. My own
capacity to contribute is steadily decreasing. So, thank you, dev
community: it's good to see people working together to make it happen.
It would be pleasing to see this being brought to a satisfactory state
and released.

Nathan, I see you replied enthusiastically and mentioned "I have much to
say on both of these [TODOs] but I won't go into detail yet...". It
seems to me it could be helpful to get that started sooner rather than
later, too, if those issues still need hashing out.

- Julian

Re: Getting to first release of pristines-on-demand feature (#525).

Posted by Karl Fogel <kf...@red-bean.com>.

On 16 Nov 2022, Evgeny Kotkov wrote:
>Apart from the required test changes, there are some technical
>TODOs that remain from the initial patch and should be resolved.
>I'll try to handle them as well.

Thank you!

Re: Getting to first release of pristines-on-demand feature (#525).

Posted by Evgeny Kotkov via dev <de...@subversion.apache.org>.

Karl Fogel <kf...@red-bean.com> writes:

> Thank you, Evgeny!  Just to make sure I understand correctly --
> the status now on the 'pristines-on-demand-on-mwf' branch is:
>
> 1) One can do 'svn checkout --store-pristines=no' to get an
> entirely pristine-less working copy.  In that working copy,
> individual files will be hydrated/dehydrated automagically on an
> as-needed basis.
>
> 2) There is no command to hydrate or dehydrate a particular file.
> Hydration and dehydration only happen as a side effect of other
> regular Subversion operations.
>
> 3) There is no way to rehydrate the entire working copy.  E.g.,
> something like 'svn update --store-pristines=yes' or 'svn hydrate
> --depth=infinity' does not exist yet.
>
> 4) Likewise, there is no way to dehydrate an existing working copy
> that currently has its pristines (even if that working copy is at
> a high-enough version format to support pristinelessness).  E.g.,
> something like 'svn update --store-pristines=no' or 'svn dehydrate
> --depth=infinity' does not exist yet.
>
> Is that all correct?

Yes, I believe that is correct.

> By the way, I do not think (2), (3), and (4) are blockers.  Just
> (1) by itself is a huge step forward and solves issue #525;

+1 on keeping the scope of the feature to just (1) for now.

> IMHO, once the tests are ready, we could merge it and release
> it to the world.

Apart from the required test changes, there are some technical
TODOs that remain from the initial patch and should be resolved.
I'll try to handle them as well.


Thanks,
Evgeny Kotkov

Re: Getting to first release of pristines-on-demand feature (#525).

Posted by Karl Fogel <kf...@red-bean.com>.

On 15 Nov 2022, Evgeny Kotkov wrote:
>Evgeny Kotkov <ev...@visualsvn.com> writes:
>
>> Perhaps we could transition into that state by committing the 
>> patch
>> and maybe re-evaluate things from there.  I could do that, 
>> assuming
>> no objections, of course.
>
>Committed the patch in https://svn.apache.org/r1905324
>
>I'll try to handle the related tasks in the near future.

Thank you, Evgeny!  Just to make sure I understand correctly -- 
the status now on the 'pristines-on-demand-on-mwf' branch is:

1) One can do 'svn checkout --store-pristines=no' to get an 
entirely pristine-less working copy.  In that working copy, 
individual files will be hydrated/dehydrated automagically on an 
as-needed basis.

2) There is no command to hydrate or dehydrate a particular file. 
Hydration and dehydration only happen as a side effect of other 
regular Subversion operations.

3) There is no way to rehydrate the entire working copy.  E.g., 
something like 'svn update --store-pristines=yes' or 'svn hydrate 
--depth=infinity' does not exist yet.

4) Likewise, there is no way to dehydrate an existing working copy 
that currently has its pristines (even if that working copy is at 
a high-enough version format to support pristinelessness).  E.g., 
something like 'svn update --store-pristines=no' or 'svn dehydrate 
--depth=infinity' does not exist yet.

Is that all correct?

By the way, I do not think (2), (3), and (4) are blockers.  Just 
(1) by itself is a huge step forward and solves issue #525; IMHO, 
once the tests are ready, we could merge it and release it to the 
world.

Best regards,
-Karl

Re: Getting to first release of pristines-on-demand feature (#525).

Posted by Evgeny Kotkov via dev <de...@subversion.apache.org>.

Evgeny Kotkov <ev...@visualsvn.com> writes:

> Perhaps we could transition into that state by committing the patch
> and maybe re-evaluate things from there.  I could do that, assuming
> no objections, of course.

Committed the patch in https://svn.apache.org/r1905324

I'll try to handle the related tasks in the near future.


Thanks,
Evgeny Kotkov

Re: Getting to first release of pristines-on-demand feature (#525).

Posted by Evgeny Kotkov via dev <de...@subversion.apache.org>.

Karl Fogel <kf...@red-bean.com> writes:

> By the way, in that thread, Evgeny Kotkov -- whose initial work
> much of this is based on -- follows up with a patch that does a
> first-pass implementation of 'svn checkout --store-pristines=no'
> (by implementing a new persistent setting in wc.db).

Perhaps we could transition into that state by committing the patch
and maybe re-evaluate things from there.  I could do that, assuming
no objections, of course.

Thanks,
Evgeny Kotkov

Getting to first release of pristines-on-demand feature (#525).

Posted by Karl Fogel <kf...@red-bean.com>.

Hi, all.  This is a high-level mail in which I try to figure out 
the current status of the issue #525 work and what's left to land 
it in trunk and release it.  Corrections and feedback welcome.

To remind everyone:

The purpose of this work is to reduce checkout sizes by optionally 
not having local pristine text-bases for WC files.  In trees that 
have lots of large binary files, this can reduce disk usage by 
about half, so it really matters for some use cases.  Also: in the 
long run, we want the user to be able to specify which files do 
and don't have pristines (but in the first release it can be a 
per-WC choice).

Current status as I understand it:

First, Julian has written up a great description of how the 
feature works from a user's perspective:

https://svn.apache.org/viewvc/subversion/branches/pristines-on-demand-on-mwf/notes/i525/i525-user-guide.md?view=log

Based on that document, it looks to me like we still need some 
well-named knobs by which the user can control this feature. 
Right now, the command-line way looks like one of these:

  $ svn checkout --compatible-version=1.15
  $ svn upgrade --compatible-version=1.15

However, there's a "TODO" note that addresses this UI point:

  > [TODO] We might change this so that upgrading to 
  > 1.15-compatible
  > format and enabling "i525pod" are separate steps and the 
  > latter is
  > optional.

I think we should implement that TODO before releasing the 
feature.  Ideally, the new WC format would support the 
"pristines-on-demand" feature without forcing a given WC to be in 
p-o-d mode.

Right now, if I understand correctly, a WC can either be entirely 
in p-o-d mode or entirely in regular mode (i.e., the current 
default, with pristines are always present for everything).  In 
other words, in its first release, this feature would *not* allow 
users specify that certain files in a WC should be p-o-d while 
other files are regular (but see the note "Now, a subtle point..." 
below about this).  It's a whole-WC thing.

However, I think it's okay to release this feature that way, 
without support for selective per-file p-o-d, as long as the UI 
for per-WC toggling is clear (e.g., not a flag like 
"--compatible-version=1.15", which doesn't say anything about the 
actual behavior being toggled).

("Toggle" may be the wrong word here, as I believe we also don't 
yet have a way to bring a WC back from p-o-d to regular mode.  Do 
we care about that for release?)

Now, a subtle point about this UI issue:

In the bright future, when we *do* support per-file specification 
of p-o-d-ness, there would be no need for a per-WC flag at all. 
Instead, users would specify that certain files should be p-o-d 
either by using client-side configuration options (e.g., all files 
larger than a given size, or having certain MIME type(s), are in 
p-o-d mode), or via command line actions to support explicit 
"hydrate" and "dehydrate" operations (these actions would either 
be top-level subcommands or options to existing commands -- we 
don't need to decide that detail now).

I guess what I'm saying is, if we are *close* to having the 
underlying WC support needed to support per-file selection of 
p-o-d-ness, then maybe it's better to go all the way and just 
finish that.  *Then* people could simply upgrade their working 
copies as usual, with no immediate behavior change resulting from 
that upgrade, and this new feature would then be available to 
them.  We would then offer...

  $ svn checkout --store-pristines=no
  $ svn upgrade --store-pristines=no

...as the gateways to the feature in the first release (so 
p-o-d-ness would to every file in the WC), and add selective UI in 
later releases, knowing that the underlying UI already supports 
it.  However, if that's a complex change in the WC code, then 
let's just release with whole-WC support and not delay.

Have I summarized the current status accurately?  Thoughts?

Please see also Julian's status email from April, which goes into 
more detail about which tests need updating, etc:

  https://lists.apache.org/thread/lm98og8jqonffcs250q5y3ft5r5qlmk5

  From: Julian Foad
  To: Daniel Shahaf
  Cc: Subversion Dev, Karl Fogel
  Subject: Re: A two-part vision for Subversion and large binary 
  objects.
  Date: Tue, 5 Apr 2022 15:50:56 +0100
  Message-ID: 
  <70...@getmailspring.com>

By the way, in that thread, Evgeny Kotkov -- whose initial work 
much of this is based on -- follows up with a patch that does a 
first-pass implementation of 'svn checkout --store-pristines=no' 
(by implementing a new persistent setting in wc.db).

Note that Julian and Daniel originally undertook this work as part 
of a contract with my company (which represents a consortium of 
companies interested in this feature).  Mostly it was Julian 
writing new code and Daniel reviewing and writing tests, and I 
thank both of them for having gotten us this far.

The work went a bit over budget not through any fault of theirs, 
but because we ran into an unexpected snag having to do with order 
of network operations in Subversion.  TL;DR: even though in 
*theory* an operation can always know at the beginning which 
pristines it has locally and which ones it doesn't, Subversion's 
current client/server communications conventions don't take 
advantage of that information in the way we'd want.  Instead, the 
client assumes pristines are present and sends up-front revision 
information to the server, causing the server to send responses 
that rely on those pristines being present.  The whole way the 
client and server talk to each other is based on this; it's 
fixable, of course, but doing so is not simple and probably not 
just client-side.  So the 'pristines-on-demand-on-mwf' branch 
takes a reasonable-but-not-perfect solution for now; the 
'pristines-on-demand-issue4892' that branches from it improves the 
situation [1], but is not complete and needn't block release. 
(See [2] for deeper discussion.)

I'll talk privately with them about finishing this and the budget 
required to do so.  I think we're close and would really like to 
see this feature released soon.  (Note that we have merged the 
'multi-wc-format' branch to trunk, in r1898187 on 2022-02-18. 
IIUC that was a necessary predecessor to everything else.)

We should be able to get there from here, right?

Best regards,
-Karl

[1] This command will give you some sense of the difference 
between those two branches:

  $ svn diff 
  https://svn.apache.org/repos/asf/subversion/branches/pristines-on-demand-on-mwf/notes/i525/i525-user-guide.md 
  https://svn.apache.org/repos/asf/subversion/branches/pristines-on-demand-issue4892/notes/i525/i525-user-guide.md

[2] 
https://lists.apache.org/thread/mwo5zy14wlkbs8j4334zn0296dl472qd

    From: Evgeny Kotkov
    To: Julian Foad
    Cc: Subversion Dev
    Subject: Re: Issue #525/#4892: on only fetching the pristines 
    we really need
    Date: Fri, 11 Mar 2022 18:23:55 +0300
    Message-ID: 
    <CA...@mail.gmail.com>

Re: A two-part vision for Subversion and large binary objects.

Posted by Julian Foad <ju...@foad.me.uk>.

A status review from my P.O.V.


Pristines (#525):

* issues filed (potential blockers):

  - #4888 authz denied during textbase sync
    (an edge case issue, not sure if it's a blocker)
  - #4889 per-WC config
    (wanted)
  - #4891 fix disabled tests
    (a few different edge cases; much of the analysis is posted in the issue)

* next milestone: merge to trunk

  - Are any of the issues blockers for merge to trunk? I would suggest not.


Getting multi-wc-format ready for release (#4883):

* issues filed (potential blockers):

  - #4885 WC upgraded and not-upgraded notifications
    (still open for some nice-to-haves, but probably done enough for MVP)
  - #4886 config for default WC version for checkout & upgrade
    ()
  - #4887 clarify/unify option names for compatible-version
    (perhaps change '--compatible-version' to '--wc-compatible-version'
or '--min-compatible-client')

* not filed:

  - API review; thread: "multi-wc-format review"
    (state is APIs are mostly private and a bit messy; not clear what,
if anything, we would want to change)


Release notes are drafted:

<https://subversion-staging.apache.org/docs/release-notes/1.15#wc-upgrade>
<https://subversion-staging.apache.org/docs/release-notes/1.15#bare-working-copies>

Testing:

Some testing by devs has been done of both multi-wc-format and
pristines-on-demand. It seems to be generally in good shape; no glaring
issues found.

Next steps I suggest:

  - propose merge to trunk
  - review the issues mentioned: fix or decide to postpone each

- Julian

Re: A two-part vision for Subversion and large binary objects.

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Karl Fogel wrote on Tue, Mar 08, 2022 at 17:59:20 -0600:
> There are reasonable arguments both ways for
> shipping MVP with/without x-hydrate functionality.
> 
> What do others think?

Just bumping Karl's question.

Cheers,

Daniel

Re: A two-part vision for Subversion and large binary objects.

Posted by Karl Fogel <kf...@red-bean.com>.

On 08 Mar 2022, Daniel Shahaf wrote:
>Sure.  I was asking whether by "once the user has a local 
>pristine" you
>meant a pristine — as in, a file under .svn/pristine/ that 
>.svn/wc.db
>knows about and uses — or Alice making a local copy of the 
>contents of
>file@BASE somewhere libsvn doesn't know about.

Well, depending on the context, I may be using the word "pristine" 
flexibly.  Sometimes I mean a literal integrated-into-wc-metadata 
pristine, and sometimes I just mean "an extra copy of the file, 
that the user has made locally".

(It's possible that the degree of precision you would like in this 
sub-discussion is not one I'm willing to adhere to consistently 
:-).  I can't always predict what will matter to a given 
interlocutor.  But I'll try to be sufficiently precise in my 
responses below at least.) 

>A manual copy of the BASE revision would "serve for local diffs 
>and
>reverts", indeed, but I would hestitate to recommend this, 
>because diff
>and revert are both core operations.  If users need to reinvent 
>these
>two wheels, then:
>
>- All the advantages of having just that one well-known «svn 
>revert»
>  button that all the users' GUI clients and scripts can press 
>  are lost
>
>- The local disk storage cost will be paid, but without all the
>  benefits: e.g., commit will use a self-delta rather than a 
>  delta
>  against BASE even if the file format does lend itself to binary 
>  diffs;
>  ra_serf's ability to not download a file if the wc has another 
>  file
>  with the same sha1 won't be used; the keyword-contraction and
>  diff-ignore-content-type features of «svn diff» will need to be
>  reimplemented; etc.
>
>- We might leave a bad impression on potential users
>
>As an MVP alternative, some sort of command to hydrate a single 
>file,
>perhaps, as you have proposed?  CLI-wise, I'll just say we might 
>want to
>mark such a command as experimental (name it "x-foo" and document 
>it has
>reduced forward compatibility promises).  Backend-wise, we'll 
>want to
>ensure a manually-hydrated file doesn't get dehydrated too soon.
>
>What's "too soon"?  Until the user explicitly requests or permits
>dehydration.  If hydration was manual, so should dehydration be.
>
>Makes sense?

Yes, thanks for the suggestion, and I agree.  I would love for MVP 
or MVP+1 to have an explicit "rehydrate" UI.  I think there 
*might* be some value to shipping MVP without such a feature, in 
order to first get some real-world experience with how people use 
pristine-less working copies, before we make long-lasting UI 
decisions.

But anyway, +1 to the general idea.

>The context of all this is whether 'update' should fetch 
>pristines for
>modified files.  I guess it should not do so by default (there's 
>no
>reason to incur the costs, and the user has opted in to 
>pristines-on-demand),
>but I don't think we should tell users to keep pristines _and not 
>tell
>libsvn_wc about them_.  The cost of implementing «svn x-hydrate»
>(however named) is smaller than the cost of asking users to 
>reimplement
>core version control functionality.

Users can already copy files behind Subversion's back, of course.

I'm worried that implementing 'svn x-hydrate' commands now would 
be premature -- we don't know enough about real-world usage yet. 
I'd feel more comfortable putting out one release (of 
x-hydrate-less MVP) to get feedback on pristine-less working 
copies.  We could even say that we're considering adding x-hydrate 
commands but that we're waiting until the next release so we can 
make sure our UI ideas match people's actual needs.

Anyone else have thoughts on this?

>If we think there are use-cases in which users will want to have
>a pristine for a modified file, whether those use-cases involve 
>«commit»
>or «diff» or «revert» or whatever else, then that pristine 
>shouldn't be
>just the user's private copy of BASE; it should be a real 
>pristine, live
>in .svn/pristine/ and be known to wc.db, and used for all svn 
>operations,
>not just those the user has reimplemented.

I understand the motivation.  There are reasonable arguments both 
ways for shipping MVP with/without x-hydrate functionality.

What do others think?

>This way, by default «commit» will send self-deltas, but if the 
>user
>wants a pristine for diffs or reverts, then reverts, diffs, and 
>commits
>will all use the pristine.  There shouldn't be any need for the 
>user to
>reimplement their own pristine store and their own diff and 
>revert
>operations.
>
>And yes, commit might not want to use pristines this way, but 
>that's
>actually a separate feature request: a request to change the 
>"When
>committing a change to a pristineful file, send a delta against 
>BASE or
>a self-delta, whichever is smaller" logic, which IIRC works by 
>computing
>a delta against BASE and comparing its length to the 
>repository-normal
>filesize, to something that doesn't compute a delta against BASE 
>in the
>first place.

Yes, that's a good point (in that last paragraph there), and we 
should take it into account when (re)implementing commit logic.

Best regards,
-Karl

Re: A two-part vision for Subversion and large binary objects.

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Karl Fogel wrote on Tue, Mar 08, 2022 at 14:01:22 -0600:
> On 08 Mar 2022, Daniel Shahaf wrote:
> > Karl Fogel:
> > > Hmm, I don't see where I was assuming that the pristine would be
> > > needed exactly once, though.  Once the user has a local pristine
> > > (by whatever means),
> > 
> > To be clear, we're only talking about pristines that libsvn_wc knows
> > about, right?  As opposed to Alice running «svn cat iota@BASE» and
> > saving the output somewhere.
> 
> Hmm, I don't think I understand the question here.  Can you ask it with more
> details / context?

Sure.  I was asking whether by "once the user has a local pristine" you
meant a pristine — as in, a file under .svn/pristine/ that .svn/wc.db
knows about and uses — or Alice making a local copy of the contents of
file@BASE somewhere libsvn doesn't know about.

> > > if she wants to keep that local pristine after committing its
> > > corresponding working file, then she could do so or not do so,
> > > depending on
> > > whether she wants to continue paying the local storage cost for it.
> > 
> > How would Alice keep iota's pristine after committing iota?  «svn commit
> > iota» deletes iota's pristine.
> 
> Like I said, I wasn't going into UI details.

Sure.  Neither was I.

> But if Subversion wants to offer a way for commit to keep the
> post-commit pristine around (in circumstances where that file would
> otherwise be pristine-less), it can do so.  This wouldn't be for MVP,
> of course; I'm just saying it's a conceivable feature and maybe some
> day we'll offer it.

+1

> For now, the way Alice would keep an "informal pristine" would be simply
> manually copy the file.  That's not a pristine in the full sense of the
> word, but it will serve for local diffs and reverts of course.

A manual copy of the BASE revision would "serve for local diffs and
reverts", indeed, but I would hestitate to recommend this, because diff
and revert are both core operations.  If users need to reinvent these
two wheels, then:

- All the advantages of having just that one well-known «svn revert»
  button that all the users' GUI clients and scripts can press are lost

- The local disk storage cost will be paid, but without all the
  benefits: e.g., commit will use a self-delta rather than a delta
  against BASE even if the file format does lend itself to binary diffs;
  ra_serf's ability to not download a file if the wc has another file
  with the same sha1 won't be used; the keyword-contraction and
  diff-ignore-content-type features of «svn diff» will need to be
  reimplemented; etc.

- We might leave a bad impression on potential users

As an MVP alternative, some sort of command to hydrate a single file,
perhaps, as you have proposed?  CLI-wise, I'll just say we might want to
mark such a command as experimental (name it "x-foo" and document it has
reduced forward compatibility promises).  Backend-wise, we'll want to
ensure a manually-hydrated file doesn't get dehydrated too soon.

What's "too soon"?  Until the user explicitly requests or permits
dehydration.  If hydration was manual, so should dehydration be.

Makes sense?

----

The context of all this is whether 'update' should fetch pristines for
modified files.  I guess it should not do so by default (there's no
reason to incur the costs, and the user has opted in to pristines-on-demand),
but I don't think we should tell users to keep pristines _and not tell
libsvn_wc about them_.  The cost of implementing «svn x-hydrate»
(however named) is smaller than the cost of asking users to reimplement
core version control functionality.

If we think there are use-cases in which users will want to have
a pristine for a modified file, whether those use-cases involve «commit»
or «diff» or «revert» or whatever else, then that pristine shouldn't be
just the user's private copy of BASE; it should be a real pristine, live
in .svn/pristine/ and be known to wc.db, and used for all svn operations,
not just those the user has reimplemented.

This way, by default «commit» will send self-deltas, but if the user
wants a pristine for diffs or reverts, then reverts, diffs, and commits
will all use the pristine.  There shouldn't be any need for the user to
reimplement their own pristine store and their own diff and revert
operations.

And yes, commit might not want to use pristines this way, but that's
actually a separate feature request: a request to change the "When
committing a change to a pristineful file, send a delta against BASE or
a self-delta, whichever is smaller" logic, which IIRC works by computing
a delta against BASE and comparing its length to the repository-normal
filesize, to something that doesn't compute a delta against BASE in the
first place.

Cheers,

Daniel

Re: A two-part vision for Subversion and large binary objects.

Posted by Karl Fogel <kf...@red-bean.com>.

On 08 Mar 2022, Daniel Shahaf wrote:
>> Hmm, I don't see where I was assuming that the pristine would 
>> be needed
>> exactly once, though.  Once the user has a local pristine (by 
>> whatever
>> means),
>
>To be clear, we're only talking about pristines that libsvn_wc 
>knows
>about, right?  As opposed to Alice running «svn cat iota@BASE» 
>and
>saving the output somewhere.

Hmm, I don't think I understand the question here.  Can you ask it 
with more details / context?

>> if she wants to keep that local pristine after committing its
>> corresponding working file, then she could do so or not do so, 
>> depending on
>> whether she wants to continue paying the local storage cost for 
>> it.
>
>How would Alice keep iota's pristine after committing iota?  «svn 
>commit
>iota» deletes iota's pristine.

Like I said, I wasn't going into UI details.  But if Subversion 
wants to offer a way for commit to keep the post-commit pristine 
around (in circumstances where that file would otherwise be 
pristine-less), it can do so.  This wouldn't be for MVP, of 
course; I'm just saying it's a conceivable feature and maybe some 
day we'll offer it.

For now, the way Alice would keep an "informal pristine" would be 
simply manually copy the file.  That's not a pristine in the full 
sense of the word, but it will serve for local diffs and reverts 
of course.

Best regards,
-Karl

Re: A two-part vision for Subversion and large binary objects.

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Karl Fogel wrote on Tue, Mar 08, 2022 at 12:32:38 -0600:
> On 08 Mar 2022, Daniel Shahaf wrote:
> > Karl Fogel wrote on Mon, Mar 07, 2022 at 13:44:03 -0600:
> > > And in the absence of fancy cross-network common-prefix detection
> > > code that we're not going to write, this would just be
> > > cost-shifting anyway.  Whatever commit-time improvement one would
> > > gain from having the pristine locally would be offset by the extra
> > > time spent fetching the pristine to make that commit-time
> > > improvement possible.
> > 
> > What assumptions is this conclusion valid under?  It seems to this
> > conclusion assumes, at least, that the uplink and downlink bandwidths
> > are equal and that the pristine will be needed exactly once (i.e.,
> > a hydrate-commit-dehydrate sequence).
> 
> I was assuming up and down speeds are roughly the same, yes.
> 
> Hmm, I don't see where I was assuming that the pristine would be needed
> exactly once, though.  Once the user has a local pristine (by whatever
> means),

To be clear, we're only talking about pristines that libsvn_wc knows
about, right?  As opposed to Alice running «svn cat iota@BASE» and
saving the output somewhere.

> if she wants to keep that local pristine after committing its
> corresponding working file, then she could do so or not do so, depending on
> whether she wants to continue paying the local storage cost for it.

How would Alice keep iota's pristine after committing iota?  «svn commit
iota» deletes iota's pristine.

Cheers,

Daniel

Re: A two-part vision for Subversion and large binary objects.

Posted by Karl Fogel <kf...@red-bean.com>.

On 08 Mar 2022, Daniel Shahaf wrote:
>Karl Fogel wrote on Mon, Mar 07, 2022 at 13:44:03 -0600:
>> And in the absence of fancy cross-network common-prefix 
>> detection code that
>> we're not going to write, this would just be cost-shifting 
>> anyway.  Whatever
>> commit-time improvement one would gain from having the pristine 
>> locally
>> would be offset by the extra time spent fetching the pristine 
>> to make that
>> commit-time improvement possible.
>
>What assumptions is this conclusion valid under?  It seems to 
>this
>conclusion assumes, at least, that the uplink and downlink 
>bandwidths
>are equal and that the pristine will be needed exactly once 
>(i.e.,
>a hydrate-commit-dehydrate sequence).

I was assuming up and down speeds are roughly the same, yes.

Hmm, I don't see where I was assuming that the pristine would be 
needed exactly once, though.  Once the user has a local pristine 
(by whatever means), if she wants to keep that local pristine 
after committing its corresponding working file, then she could do 
so or not do so, depending on whether she wants to continue paying 
the local storage cost for it.

Best regards,
-Karl

Re: A two-part vision for Subversion and large binary objects.

Posted by Karl Fogel <kf...@red-bean.com>.

On 08 Mar 2022, Daniel Shahaf wrote:
>I wasn't proposing we require such a step.  I was merely saying 
>that was
>one of several possible solutions to the "How to commit a 
>pristineless
>file" question.  Here they are again:
>
>1. Download the pristine and then send a regular delta
>2. Send a self-delta
>3. rsync the file
>4. Avoid getting into this situation in the first place
>
>I guess we'll be happy with (2) for the MVP.

Very happy with (2) for MVP, and possibly for all time :-).

Re: A two-part vision for Subversion and large binary objects.

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Daniel Sahlberg wrote on Tue, Mar 08, 2022 at 14:34:06 +0100:
> Den tis 8 mars 2022 kl 14:17 skrev Daniel Shahaf <d....@daniel.shahaf.name>:
> 
> > An alternative is to require the user to let svn know before they're
> > starting to edit a file, so we can create a pristine off the on-disk
> > file.  This way we won't have pristineless modified files in the first
> > place.
> >
> 
> Not "require". It might be an interesting for some use-case to have "svn
> create-pristine-from-wc" as a manual step, but not adding this as part of
> the normal workflow. I have some wc's that might benefit from being
> pristine-less, but I'm not prepared to pay the extra cost (time-wise) of an
> svn:needs-locking-like step for every file I need to modify. I don't think
> this new command (or option) is MVP.

I wasn't proposing we require such a step.  I was merely saying that was
one of several possible solutions to the "How to commit a pristineless
file" question.  Here they are again:

1. Download the pristine and then send a regular delta
2. Send a self-delta
3. rsync the file
4. Avoid getting into this situation in the first place

I guess we'll be happy with (2) for the MVP.

Cheers,

Daniel

Re: A two-part vision for Subversion and large binary objects.

Posted by Daniel Sahlberg <da...@gmail.com>.

Den fre 11 mars 2022 kl 13:05 skrev Julian Foad <ju...@apache.org>:

> Here is an approach that does *not* satisfy both sides of this argument:
>
> [[[
> svn propset "svn:no-pristines" "*" doc/
>
> cat >> ~/.subversion/config <<-EOF
>     [auto-props]
>     src/**/*.exe = "svn:no-pristines = *"
> EOF
> ]]]
>
> and we make standard Subversion control its pristine storage based on
> looking for the versioned property "svn:no-pristines".
>
> Here is an approach that *does* satisfy both sides of this argument. It
> is indirect.
>
> First step, we introduce one level of indirection, so that client
> behaviour knobs are not attached directly to the versioned data:
>
> [[[
> svn propset "danielsahlberg:no-pristines" "*" doc/
>
> cat >> ~/.subversion/config <<-EOF
>     [auto-props]
>     src/**/*.exe = "danielsahlberg:no-pristines = *"
>     [working-copy]
>     omit-pristines-where-this-prop-is-set = "danielsahlberg:no-pristines"
> EOF
> ]]]
>
> and we make standard Subversion control its pristine storage in
> accordance with the config option "omit-pristines-where-this-prop-is-set".
>
> Second step, we provide server-side configuration for automating those
> config settings, as follows:
>
> [[[
> svn propset --revprop -r0 \
> "svn:server-dictated-config:config:auto-props:src/**/*.exe" \
> "danielsahlberg:no-pristines = *"
> svn propset --revprop -r0 \
> "svn:server-dictated-config:config:working-copy:omit-pristines-where-this-prop-is-set"
> \
> "danielsahlberg:no-pristines"
> ]]]
>
> and we make standard Subversion read config options from the
> repository's r0 revprops, and use those as default values for local
> config options.
>
> This is not a concrete proposal, just trying to make a clear explanation.
>

Sorry for bringing auto-props into the discussion. I forgot this was a
client side configuration and I realise now that I can't have that cookie
without setting up a system for distributing client side settings. I
believe this is out of scope for #525.

Agreed, those are two possible solutions and the second one would solve
both the "managing pristines centrally" and "setting auto-props config". It
also seems to me that the second option would include significantly more
code (fetching client-side settings both from the config file and from -r0
revprops as well as checking for the name of the pristine-controlling
property). I could live with just an "svn:no-pristines=*" and setting up
some script to automatically add these properties when things are committed
to the repository.

/Daniel

Re: A two-part vision for Subversion and large binary objects.

Posted by Julian Foad <ju...@apache.org>.

Here is an approach that does *not* satisfy both sides of this argument:

[[[
svn propset "svn:no-pristines" "*" doc/

cat >> ~/.subversion/config <<-EOF
    [auto-props]
    src/**/*.exe = "svn:no-pristines = *"
EOF
]]]

and we make standard Subversion control its pristine storage based on
looking for the versioned property "svn:no-pristines".

Here is an approach that *does* satisfy both sides of this argument. It
is indirect.

First step, we introduce one level of indirection, so that client
behaviour knobs are not attached directly to the versioned data:

[[[
svn propset "danielsahlberg:no-pristines" "*" doc/

cat >> ~/.subversion/config <<-EOF
    [auto-props]
    src/**/*.exe = "danielsahlberg:no-pristines = *"
    [working-copy]
    omit-pristines-where-this-prop-is-set = "danielsahlberg:no-pristines"
EOF
]]]

and we make standard Subversion control its pristine storage in
accordance with the config option "omit-pristines-where-this-prop-is-set".

Second step, we provide server-side configuration for automating those
config settings, as follows:

[[[
svn propset --revprop -r0 \
"svn:server-dictated-config:config:auto-props:src/**/*.exe" \
"danielsahlberg:no-pristines = *"
svn propset --revprop -r0 \
"svn:server-dictated-config:config:working-copy:omit-pristines-where-this-prop-is-set" \
"danielsahlberg:no-pristines"
]]]

and we make standard Subversion read config options from the
repository's r0 revprops, and use those as default values for local
config options.

This is not a concrete proposal, just trying to make a clear explanation.

- Julian

Re: A two-part vision for Subversion and large binary objects.

Posted by Daniel Sahlberg <da...@gmail.com>.

Den fre 11 mars 2022 kl 11:28 skrev Vincent Lefevre <vincent-svn@vinc17.net
>:

> On 2022-03-11 10:04:36 +0000, Julian Foad wrote:
> > Daniel Sahlberg wrote:
> > > I'm taking an opposite position with regards on where this should be
> > > administred. [...] I would prefer a multi-level approach where the
> > > repository (through svn:foo properties) could suggest pristine-less WC
> >
> > I understand completely your case, but the solution you need is a way to
> > configure your client's behaviour remotely, and that is not necessarily
> > best done by Subversion versioned properties. Do you see the
> > distinction? Rather, what you need is for client configuration to be
> > managed centrally and obeyed by your clients. The server and clients
> > involved *could* be your Subversion repository server and Subversion
> > clients, but could alternatively be some other mechanism. You just need
> > some mechanism that works and is easy enough to deploy.
>
> If I understand what Daniel Sahlberg intends to mean is that the
> property would actually tell the client what to do *by default*,
> removing the need to configure the client. But I suppose that its
> use would be very uncommon (say, for a repository storing only
> big binary files, the main goal being to keep the history, but
> where "svn diff" would never be done in practice).
>

Correct!

Having such a property on directories and/or individual files would
> be much more interesting, but in such a case, there should be more
> than 2 levels of suggestion.
>

In our case we have directories with binary blobs of documentation and I
would like to set it on that directory, but not on the directories
containing source code. We also commit compiled code (identified by file
name extension) and I would like to set it (via auto-props) on these files.

Again: I'm not suggesting that Subversion should set such settings *by
default* but provide a mechanism for the committers to set it.

Kind regards,
Daniel Sahlberg

Re: A two-part vision for Subversion and large binary objects.

Posted by Vincent Lefevre <vi...@vinc17.net>.

On 2022-03-11 10:04:36 +0000, Julian Foad wrote:
> Daniel Sahlberg wrote:
> > I'm taking an opposite position with regards on where this should be
> > administred. [...] I would prefer a multi-level approach where the
> > repository (through svn:foo properties) could suggest pristine-less WC
> 
> I understand completely your case, but the solution you need is a way to
> configure your client's behaviour remotely, and that is not necessarily
> best done by Subversion versioned properties. Do you see the
> distinction? Rather, what you need is for client configuration to be
> managed centrally and obeyed by your clients. The server and clients
> involved *could* be your Subversion repository server and Subversion
> clients, but could alternatively be some other mechanism. You just need
> some mechanism that works and is easy enough to deploy.

If I understand what Daniel Sahlberg intends to mean is that the
property would actually tell the client what to do *by default*,
removing the need to configure the client. But I suppose that its
use would be very uncommon (say, for a repository storing only
big binary files, the main goal being to keep the history, but
where "svn diff" would never be done in practice).

Having such a property on directories and/or individual files would
be much more interesting, but in such a case, there should be more
than 2 levels of suggestion.

-- 
Vincent Lefèvre <vi...@vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

Re: A two-part vision for Subversion and large binary objects.

Posted by Mark Phippard <ma...@gmail.com>.

On Fri, Mar 11, 2022 at 5:16 AM Daniel Sahlberg
<da...@gmail.com> wrote:
>
> Den fre 11 mars 2022 kl 11:04 skrev Julian Foad <ju...@apache.org>:
>>
>> Daniel Sahlberg wrote:
>> > I'm taking an opposite position with regards on where this should be
>> > administred. [...] I would prefer a multi-level approach where the
>> > repository (through svn:foo properties) could suggest pristine-less WC
>>
>> I understand completely your case, but the solution you need is a way to
>> configure your client's behaviour remotely, and that is not necessarily
>> best done by Subversion versioned properties. Do you see the
>> distinction? Rather, what you need is for client configuration to be
>> managed centrally and obeyed by your clients. The server and clients
>> involved *could* be your Subversion repository server and Subversion
>> clients, but could alternatively be some other mechanism. You just need
>> some mechanism that works and is easy enough to deploy.
>
>
> I do see that distinction and I completely agree with your analysis.
>
> My position is that svn properties is the easiest way for me to distribute this kind of client configuration (we could call it "client hints"). If there is a majority that Subversion should not provide that, then I won't stand in the way of consensus.
>
> There are a lot of other options as well to configure the clients, AD group policies probably being the most common in a corporate environment but these have a higher bar to get started.

I agree with Daniel completely ... including not wanting to stand in
the way of consensus. I think it just depends if you are more used to
supporting "users" in the corporate world vs thinking like a
super-experience *nix hacker like Karl and Julian.

I also think that the primary use case for this feature is to offer
better handling for large binary files. And regardless of whether you
are a corporate user or an experienced hacker there is going to be
very little use for storing a second copy of those files in the
pristines. So I have always thought that a svn: property based
approach makes the most sense for distributing this information to the
clients.

I would favor making it simple for the user and if you really have
strong beliefs that the client should have full control then allow a
power-user to have options to override those defaults.

Again ... I do not want to stand in the way of consensus or alter the
MVP. Like Daniel, I am just saying let's not shut down the possibility
of this approach in the future.

Mark

Re: A two-part vision for Subversion and large binary objects.

Posted by Daniel Sahlberg <da...@gmail.com>.

Den fre 11 mars 2022 kl 11:04 skrev Julian Foad <ju...@apache.org>:

> Daniel Sahlberg wrote:
> > I'm taking an opposite position with regards on where this should be
> > administred. [...] I would prefer a multi-level approach where the
> > repository (through svn:foo properties) could suggest pristine-less WC
>
> I understand completely your case, but the solution you need is a way to
> configure your client's behaviour remotely, and that is not necessarily
> best done by Subversion versioned properties. Do you see the
> distinction? Rather, what you need is for client configuration to be
> managed centrally and obeyed by your clients. The server and clients
> involved *could* be your Subversion repository server and Subversion
> clients, but could alternatively be some other mechanism. You just need
> some mechanism that works and is easy enough to deploy.
>

I do see that distinction and I completely agree with your analysis.

My position is that svn properties is the easiest way for me to distribute
this kind of client configuration (we could call it "client hints"). If
there is a majority that Subversion should not provide that, then I won't
stand in the way of consensus.

There are a lot of other options as well to configure the clients, AD group
policies probably being the most common in a corporate environment but
these have a higher bar to get started.

Kind regards,
Daniel

Re: A two-part vision for Subversion and large binary objects.

Posted by Julian Foad <ju...@apache.org>.

Daniel Sahlberg wrote:
> I'm taking an opposite position with regards on where this should be
> administred. [...] I would prefer a multi-level approach where the
> repository (through svn:foo properties) could suggest pristine-less WC

I understand completely your case, but the solution you need is a way to
configure your client's behaviour remotely, and that is not necessarily
best done by Subversion versioned properties. Do you see the
distinction? Rather, what you need is for client configuration to be
managed centrally and obeyed by your clients. The server and clients
involved *could* be your Subversion repository server and Subversion
clients, but could alternatively be some other mechanism. You just need
some mechanism that works and is easy enough to deploy.

Re: A two-part vision for Subversion and large binary objects.

Posted by Daniel Sahlberg <da...@gmail.com>.

Den tors 10 mars 2022 kl 18:48 skrev Karl Fogel <kf...@red-bean.com>:

> On 10 Mar 2022, Lorenz wrote:
> >Daniel Sahlberg wrote:
> >
> >>Den tis 8 mars 2022 kl 14:17 skrev Daniel Shahaf
> >><d....@daniel.shahaf.name>:
> >>
> >>> An alternative is to require the user to let svn know before
> >>> they're
> >>> starting to edit a file, so we can create a pristine off the
> >>> on-disk
> >>> file.  This way we won't have pristineless modified files in
> >>> the first
> >>> place.
> >>>
> >>
> >>Not "require". It might be an interesting for some use-case to
> >>have "svn
> >>create-pristine-from-wc" as a manual step, but not adding this
> >>as part of
> >>the normal workflow. I have some wc's that might benefit from
> >>being
> >>pristine-less, but I'm not prepared to pay the extra cost
> >>(time-wise) of an
> >>svn:needs-locking-like step for every file I need to modify. I
> >>don't think
> >>this new command (or option) is MVP.
> >
> >maybe something like svn:needs-prestine-for-edit similar to
> >svn:needs-lock?
> >
> >Or, when finally get a file specific configuration for prestine
> >handling, that case could be included there?
>
> There's one principle I'm pretty firmly convinced about (I mean,
> of course everything is open for discussion here, I'm just saying
> where I'm starting from):
>
> Everything to do with pristines is a matter of *local
> configuration* ("configuration" interpreted broadly -- it includes
> local run-time options, as well as stuff in config files).
>
> In other words, it would be a mistake to create new svn:foo
> properties that indicate what the local pristine behavior should
> be, because the user's needs are inherently local and specific to
> that user's situation (how fast is their network, how much disk
> space do they have).  In other words, those needs are *not* about
> the file itself, but rather are solely about the constraints of
> the local (client-side) environment.
>
> Now, local configuration could look at *existing* svn:foo
> properties that serve other purposes (e.g., svn:mime-type), in
> order to make decisions about pristines, the same way local config
> can look at file size to make such decisions.  And if some
> organization wants to set their own custom non-svn:foo properties
> and have local config look at those custom properties for
> guidance, that's fine -- that's their business.
>
> But SVN should not be building in such things itself.  Pristines
> are a purely local phenomenon.  An svn:foo property whose purpose
> is to give guidance about pristines would be a directional
> mistake, IMHO.
>

I'm taking an opposite position with regards on where this should be
administred. My primary use case is with users who manage their own
computers (ie, I have no simple way of pushing settings) but who are not
interested in configuring a lot of client side option. I know their use
case enough to know that they would benefit from pristine-less WCs (99% of
the work is made while connected on a fast network connection and svn diff
(et al.) is a relatively uncommon operation).

I would prefer a multi-level approach where the repository (through svn:foo
properties) could suggest pristine-less WC (even better, to have that
property on directories and on individual files) but the client could
override this suggestion (either through general config in .svn/ or through
cmdline options).

With certain repositories (like the ASF repo) this knowledge does not exist
and I would expect the property isn't set. With other repositories
(internal corporate repositories) let the administrator handle things and
powerusers could overrule.

That might not be MVP, but at least lets not rule it out for the future.

Kind regards,
Daniel Sahlberg

Re: A two-part vision for Subversion and large binary objects.

Posted by Karl Fogel <kf...@red-bean.com>.

On 10 Mar 2022, Lorenz wrote:
>Daniel Sahlberg wrote:
>
>>Den tis 8 mars 2022 kl 14:17 skrev Daniel Shahaf 
>><d....@daniel.shahaf.name>:
>>
>>> An alternative is to require the user to let svn know before 
>>> they're
>>> starting to edit a file, so we can create a pristine off the 
>>> on-disk
>>> file.  This way we won't have pristineless modified files in 
>>> the first
>>> place.
>>>
>>
>>Not "require". It might be an interesting for some use-case to 
>>have "svn
>>create-pristine-from-wc" as a manual step, but not adding this 
>>as part of
>>the normal workflow. I have some wc's that might benefit from 
>>being
>>pristine-less, but I'm not prepared to pay the extra cost 
>>(time-wise) of an
>>svn:needs-locking-like step for every file I need to modify. I 
>>don't think
>>this new command (or option) is MVP.
>
>maybe something like svn:needs-prestine-for-edit similar to
>svn:needs-lock?
>
>Or, when finally get a file specific configuration for prestine
>handling, that case could be included there?

There's one principle I'm pretty firmly convinced about (I mean, 
of course everything is open for discussion here, I'm just saying 
where I'm starting from):

Everything to do with pristines is a matter of *local 
configuration* ("configuration" interpreted broadly -- it includes 
local run-time options, as well as stuff in config files).

In other words, it would be a mistake to create new svn:foo 
properties that indicate what the local pristine behavior should 
be, because the user's needs are inherently local and specific to 
that user's situation (how fast is their network, how much disk 
space do they have).  In other words, those needs are *not* about 
the file itself, but rather are solely about the constraints of 
the local (client-side) environment.

Now, local configuration could look at *existing* svn:foo 
properties that serve other purposes (e.g., svn:mime-type), in 
order to make decisions about pristines, the same way local config 
can look at file size to make such decisions.  And if some 
organization wants to set their own custom non-svn:foo properties 
and have local config look at those custom properties for 
guidance, that's fine -- that's their business.

But SVN should not be building in such things itself.  Pristines 
are a purely local phenomenon.  An svn:foo property whose purpose 
is to give guidance about pristines would be a directional 
mistake, IMHO.

Best regards,
-Karl

Re: A two-part vision for Subversion and large binary objects.

Posted by Lorenz <lo...@yahoo.com>.

Daniel Sahlberg wrote:

>Den tis 8 mars 2022 kl 14:17 skrev Daniel Shahaf <d....@daniel.shahaf.name>:
>
>> An alternative is to require the user to let svn know before they're
>> starting to edit a file, so we can create a pristine off the on-disk
>> file.  This way we won't have pristineless modified files in the first
>> place.
>>
>
>Not "require". It might be an interesting for some use-case to have "svn
>create-pristine-from-wc" as a manual step, but not adding this as part of
>the normal workflow. I have some wc's that might benefit from being
>pristine-less, but I'm not prepared to pay the extra cost (time-wise) of an
>svn:needs-locking-like step for every file I need to modify. I don't think
>this new command (or option) is MVP.

maybe something like svn:needs-prestine-for-edit similar to
svn:needs-lock?

Or, when finally get a file specific configuration for prestine
handling, that case could be included there?
-- 

Lorenz

Re: A two-part vision for Subversion and large binary objects.

Posted by Daniel Sahlberg <da...@gmail.com>.

Den tis 8 mars 2022 kl 14:17 skrev Daniel Shahaf <d....@daniel.shahaf.name>:

> An alternative is to require the user to let svn know before they're
> starting to edit a file, so we can create a pristine off the on-disk
> file.  This way we won't have pristineless modified files in the first
> place.
>

Not "require". It might be an interesting for some use-case to have "svn
create-pristine-from-wc" as a manual step, but not adding this as part of
the normal workflow. I have some wc's that might benefit from being
pristine-less, but I'm not prepared to pay the extra cost (time-wise) of an
svn:needs-locking-like step for every file I need to modify. I don't think
this new command (or option) is MVP.

Kind regards,
Daniel Sahlberg

Re: A two-part vision for Subversion and large binary objects.

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Karl Fogel wrote on Mon, Mar 07, 2022 at 13:44:03 -0600:
> On 07 Mar 2022, Mark Phippard wrote:
> > > I do understand the reasons why Evgeny thought pre-fetching
> > > pristines for modified files as part of an 'update' could be a
> > > good idea.
> > 
> > My recollection of the first version of this patch, commit needed the
> > pristine and so had to fetch it before the commit happened. This may
> > have been a reason it seemed like a good idea at the time for update
> > to get the pristine.
> 
> Ah, maybe so; I didn't realize that.
> 
> If that was the motivation, then there's even less reason for 'update' to
> fetch pristines for modified files.  Having the pristine is not only
> unnecessary for the commit, in most cases having the pristine is not even
> particularly *useful* to the commit.  These types of files tend to be
> non-diffable anyway (i.e., not even binary diffable), broadly speaking and
> with occasional exceptions of course.  For example, a common such file is a
> gigantic gzipped blob.  Tiny changes in the uncompressed text will lead to a
> completely different gzipped blob.

And «update» could send a self-compressed delta anyway.

> (I suppose it might be the case that if the first change is made very late
> in the uncompressed text, then the revised gzipped blob can, under some
> real-world circumstances, actually be bit-for-bit the same as the original
> for a long initial prefix before showing any difference.  But this is a rare
> enough case that I don't think Subversion should be trying to detect it and
> support it.  We'd essentially have to incorporate the rsync rolling-checksum
> algorithm, or something like it, into our diff negotiation to even get any
> advantage.)

This use-case may be a rare one, but rsync _was_ in fact designed to
solve precisely the problem that «svn commit» of a pristineless file
needs to solve.  So, suppose we did use the rsync algorithm, would this
benefit any other use-cases other than the "first change is at the end
of the file" use-case you describe here?  Is it faster to commit a file
by sending a self-delta of it or by rsync'ing it?

Furthermore, the user may be able to deliberately create the huge file
in a way that makes it rsync-friendly: for instance, `svnadmin dump`
emits hashes in sorted order, which has the side-effect of making dump
files rsync-friendly.  For gzip files there is «gzip --rsyncable».

None of this is needed for the MVP, of course, but I do think the basic
principle of using rsync is in fact sound.

An alternative is to require the user to let svn know before they're
starting to edit a file, so we can create a pristine off the on-disk
file.  This way we won't have pristineless modified files in the first
place.

> And in the absence of fancy cross-network common-prefix detection code that
> we're not going to write, this would just be cost-shifting anyway.  Whatever
> commit-time improvement one would gain from having the pristine locally
> would be offset by the extra time spent fetching the pristine to make that
> commit-time improvement possible.

What assumptions is this conclusion valid under?  It seems to this
conclusion assumes, at least, that the uplink and downlink bandwidths
are equal and that the pristine will be needed exactly once (i.e.,
a hydrate-commit-dehydrate sequence).

I'm not objecting to making assumptions; we aren't going to address all
use-cases in 1.15.  I'm just asking that we make our assumptions explicit.

Cheers,

Daniel

> So... yeah.  Let's not do that :-).
> 
> Best regards,
> -Karl

Re: A two-part vision for Subversion and large binary objects.

Posted by Karl Fogel <kf...@red-bean.com>.

On 07 Mar 2022, Mark Phippard wrote:
>> I do understand the reasons why Evgeny thought pre-fetching
>> pristines for modified files as part of an 'update' could be a
>> good idea.
>
>My recollection of the first version of this patch, commit needed 
>the
>pristine and so had to fetch it before the commit happened. This 
>may
>have been a reason it seemed like a good idea at the time for 
>update
>to get the pristine.

Ah, maybe so; I didn't realize that.

If that was the motivation, then there's even less reason for 
'update' to fetch pristines for modified files.  Having the 
pristine is not only unnecessary for the commit, in most cases 
having the pristine is not even particularly *useful* to the 
commit.  These types of files tend to be non-diffable anyway 
(i.e., not even binary diffable), broadly speaking and with 
occasional exceptions of course.  For example, a common such file 
is a gigantic gzipped blob.  Tiny changes in the uncompressed text 
will lead to a completely different gzipped blob.

(I suppose it might be the case that if the first change is made 
very late in the uncompressed text, then the revised gzipped blob 
can, under some real-world circumstances, actually be bit-for-bit 
the same as the original for a long initial prefix before showing 
any difference.  But this is a rare enough case that I don't think 
Subversion should be trying to detect it and support it.  We'd 
essentially have to incorporate the rsync rolling-checksum 
algorithm, or something like it, into our diff negotiation to even 
get any advantage.)

And in the absence of fancy cross-network common-prefix detection 
code that we're not going to write, this would just be 
cost-shifting anyway.  Whatever commit-time improvement one would 
gain from having the pristine locally would be offset by the extra 
time spent fetching the pristine to make that commit-time 
improvement possible.

So... yeah.  Let's not do that :-).

Best regards,
-Karl

Re: A two-part vision for Subversion and large binary objects.

Posted by Mark Phippard <ma...@gmail.com>.

On Sun, Mar 6, 2022 at 11:19 PM Karl Fogel <kf...@red-bean.com> wrote:

[snipped]

Agree with everything you have said.

> I do understand the reasons why Evgeny thought pre-fetching
> pristines for modified files as part of an 'update' could be a
> good idea.

My recollection of the first version of this patch, commit needed the
pristine and so had to fetch it before the commit happened. This may
have been a reason it seemed like a good idea at the time for update
to get the pristine.

Mark

Re: A two-part vision for Subversion and large binary objects.

Posted by Karl Fogel <kf...@red-bean.com>.

On 04 Mar 2022, Julian Foad wrote:
>I had a talk with Karl about this, and now I understand the 
>concern much better.
>
>(Karl, please correct anything I misrepresent.)

You've described it well, Julian.  Thank you (and thank you also 
for your patience in explaining to me the current State Of The 
Onion in a phone call, when I was still behind on reading dev@ 
posts -- I'm caught up now).

The one thing I would add to your summary below is that the 
concern on the client side is not just about wasted time (that is, 
the time spent fetching pristines for files that won't, in the 
end, actually need pristines locally).

The concern is also local *space*.  It's not unusual for one of 
these working copies to bring a local disk to within a few 
enormous files of full disk usage -- in other words, to be in a 
situation where fetching a certain number of pristines could 
result in the disk running out of space.  So if one has modified N 
of the large versioned files, and then an update brings down N 
correspondingly large pristines, well, hilarity could ensue :-).

But even beyond my experience with particular use cases, I think 
we should aim for the simplicity of a principle here:

Principle: When a file is checked out without its pristine, then 
SVN should never fetch that pristine unless we actually need to.

(Naturally, this principle applies, via the distributive property, 
to all the files in a fully pristine-less working copy.  Since in 
the future we may offer UI to allow working copies in which some 
files are checked out with pristine and some without, I am being 
careful to articulate the principle here as being about files 
rather than about working copies.)

The justification for this principle is that there's presumably a 
*reason* why the user requested that there be no pristine for that 
file.  Whatever that reason is, we have no reason to think we know 
better than the user does. 

The most likely reason is that the file is huge and the user 
doesn't want to pay the disk-space cost, nor the network-time cost 
in the case of updates for which the file hasn't changed in the 
repository.  But maybe the reason is something else.  Who knows? 
Not our business.  The user told SVN what they wanted, and SVN 
should do that thing.

Now, if the user runs an operation that requires a pristine, 
that's different -- then they're effectively notifying us that 
they're changing their decision.  We should obey the user in that 
case too.  It's just that it would be bad form for us to go 
fetching a pristine when a) the user already said they don't want 
it and b) SVN has no identifiable need for it in this operation.

I do understand the reasons why Evgeny thought pre-fetching 
pristines for modified files as part of an 'update' could be a 
good idea.  There would surely be _some_ occasions where a user 
would be pleasantly surprised to find that they have that pristine 
locally just when they need it.  But in the end, I believe that

a) In the most common use cases, it's probably not what the user 
wants anyway;

b) The failure mode of unnecessary fetching and storing is much 
worse than the failure mode of not having fetched a pristine that 
someone might turn out to want (there are workarounds for the 
latter);

c) It's generally better if we have a simple and comprehensible 
principle, like the one I articulated above.

Best regards,
-Karl

>He shares the view that it would be unacceptable for 'svn update' 
>to
>fetch pristines of files that have become locally modified since 
>the
>previous fetch opportunity, but that are not actually being 
>updated by
>this update.
>
>In his use cases a developer locally modifies some large 
>files. The
>developer also modifies some small files (such as 'readme' files
>describing the large files). The developer doesn't need to diff 
>or
>revert the large files, and so chooses the checkout mode which 
>doesn't
>keep the pristines initially.
>
>Before committing, the developer runs 'update', expecting to 
>fetch any
>remote changes to the small files (and not large files, not in 
>this
>case), expecting it to be quick, and then the developer continues 
>work
>and eventually commits.
>
>The time taken to fetch the pristines of the large, modified 
>files would
>be long (for example, ten minutes). Taking a long time for the 
>commit is
>acceptable because the commit is the end of the work flow (and 
>the
>developer can go away or move on to something else while it 
>proceeds).
>The concern is that taking a long time at the update stage would 
>be too disruptive.
>
>It wouldn't be a problem for an operation that really needs the
>pristines taking a long time. (Revert, for example.) The 
>perception is
>that update doesn't really need them. That is, while it obviously 
>needs
>in principle to fetch the new pristines of the files that need 
>updating
>to a new version from the server (or fetch a delta and so be able 
>to
>generate the pristine), it doesn't, in principle, need pristines 
>of
>files that it isn't going to update. In this use case, it isn't 
>going to
>update the large, locally modified files. And fetching their 
>pristines
>wouldn't massively benefit the commit either, because they are 
>poorly
>diffable kinds of files. So it is wasted time.
>
>If the implementation currently requires these pristines, that 
>would
>seem to be an implementation detail and we would seek to change 
>that.
>
>So my task now is to investigate for any way we can eliminate or
>optimise the unnecessary fetching, at least in this specific 
>case.
>
>Filed as https://subversion.apache.org/issue/4892 .
>
>I will investigate this issue next week.
>
>- Julian

Re: A two-part vision for Subversion and large binary objects.

Posted by Mark Phippard <ma...@gmail.com>.

On Fri, Mar 4, 2022 at 3:52 PM Julian Foad <ju...@apache.org> wrote:
>
> > Mark Phippard wrote:
> >> [...] For an update, I think it is unexpected and undesirable. [...]
>
> I had a talk with Karl about this, and now I understand the concern much better.
>
> (Karl, please correct anything I misrepresent.)
>
> He shares the view that it would be unacceptable for 'svn update' to
> fetch pristines of files that have become locally modified since the
> previous fetch opportunity, but that are not actually being updated by
> this update.
>
> In his use cases a developer locally modifies some large files. The
> developer also modifies some small files (such as 'readme' files
> describing the large files). The developer doesn't need to diff or
> revert the large files, and so chooses the checkout mode which doesn't
> keep the pristines initially.
>
> Before committing, the developer runs 'update', expecting to fetch any
> remote changes to the small files (and not large files, not in this
> case), expecting it to be quick, and then the developer continues work
> and eventually commits.
>
> The time taken to fetch the pristines of the large, modified files would
> be long (for example, ten minutes). Taking a long time for the commit is
> acceptable because the commit is the end of the work flow (and the
> developer can go away or move on to something else while it proceeds).
> The concern is that taking a long time at the update stage would be too disruptive.
>
> It wouldn't be a problem for an operation that really needs the
> pristines taking a long time. (Revert, for example.) The perception is
> that update doesn't really need them. That is, while it obviously needs
> in principle to fetch the new pristines of the files that need updating
> to a new version from the server (or fetch a delta and so be able to
> generate the pristine), it doesn't, in principle, need pristines of
> files that it isn't going to update. In this use case, it isn't going to
> update the large, locally modified files. And fetching their pristines
> wouldn't massively benefit the commit either, because they are poorly
> diffable kinds of files. So it is wasted time.
>
> If the implementation currently requires these pristines, that would
> seem to be an implementation detail and we would seek to change that.
>
> So my task now is to investigate for any way we can eliminate or
> optimise the unnecessary fetching, at least in this specific case.
>
> Filed as https://subversion.apache.org/issue/4892 .
>
> I will investigate this issue next week.

Thanks Julian. I think your summary here matches what I was
saying/thinking well.

Hope it works out.

Mark

Re: A two-part vision for Subversion and large binary objects.

Posted by Julian Foad <ju...@apache.org>.

> Mark Phippard wrote:
>> [...] For an update, I think it is unexpected and undesirable. [...]

I had a talk with Karl about this, and now I understand the concern much better.

(Karl, please correct anything I misrepresent.)

He shares the view that it would be unacceptable for 'svn update' to
fetch pristines of files that have become locally modified since the
previous fetch opportunity, but that are not actually being updated by
this update.

In his use cases a developer locally modifies some large files. The
developer also modifies some small files (such as 'readme' files
describing the large files). The developer doesn't need to diff or
revert the large files, and so chooses the checkout mode which doesn't
keep the pristines initially.

Before committing, the developer runs 'update', expecting to fetch any
remote changes to the small files (and not large files, not in this
case), expecting it to be quick, and then the developer continues work
and eventually commits.

The time taken to fetch the pristines of the large, modified files would
be long (for example, ten minutes). Taking a long time for the commit is
acceptable because the commit is the end of the work flow (and the
developer can go away or move on to something else while it proceeds).
The concern is that taking a long time at the update stage would be too disruptive.

It wouldn't be a problem for an operation that really needs the
pristines taking a long time. (Revert, for example.) The perception is
that update doesn't really need them. That is, while it obviously needs
in principle to fetch the new pristines of the files that need updating
to a new version from the server (or fetch a delta and so be able to
generate the pristine), it doesn't, in principle, need pristines of
files that it isn't going to update. In this use case, it isn't going to
update the large, locally modified files. And fetching their pristines
wouldn't massively benefit the commit either, because they are poorly
diffable kinds of files. So it is wasted time.

If the implementation currently requires these pristines, that would
seem to be an implementation detail and we would seek to change that.

So my task now is to investigate for any way we can eliminate or
optimise the unnecessary fetching, at least in this specific case.

Filed as https://subversion.apache.org/issue/4892 .

I will investigate this issue next week.

- Julian

Re: A two-part vision for Subversion and large binary objects.

Posted by Julian Foad <ju...@apache.org>.

Mark Phippard wrote:
> That comment specifically talks about diff. [...] For an update, I
> think it is unexpected and undesirable. [...]

You're right, the comment I pointed to doesn't do anything to justify
why 'update' should fetch it. And I agree it would be better if it
didn't. Maybe we will be able to optimise that case. I don't know.

- Julian

Re: A two-part vision for Subversion and large binary objects.

Posted by Mark Phippard <ma...@gmail.com>.

On Tue, Mar 1, 2022 at 10:34 AM Julian Foad <ju...@apache.org> wrote:
>
> On Feb 18 2022, Mark Phippard wrote:
> >> [It fetches and stores pristines of modified files;] it doesn't mean
> >> "store no pristines" in that WC.
> >
> > I am curious what Karl thinks given that he is living this scenario
> > today and wants the feature. I would think that having update create
> > pristines for any modified file taints its usefulness. That said, it
> > is probably still better than what they have today and if the user is
> > on a fast network and disk space is not too big of an issue it might
> > not matter too much. I personally think this is the biggest issue to
> > solve though, more so than selectively choosing pristines for
> > different files. I think the feature just really does not "work as
> > advertised" if it is going to behave this way.
>
> Hello, Mark. Maybe Karl will yet answer, but I didn't want to leave this
> hanging any longer.
>
> This design was anticipated as far back as a 2006-06-09 comment on #525
> by Oswald Buddenhagen [1], where it is described as one of the
> possibilities among variations and alternatives. I'm not saying that
> justifies choosing it as the best solution, just that it's not arriving
> now from off the radar.

That comment specifically talks about diff. I think it is entirely
reasonable that for diff the feature works the way it does (fetch and
keep the pristine).

For an update, I think it is unexpected and undesirable. At least if
the HEAD revision of the file on the server is still the same as what
I had in my WC.

> We've already discussed how there are certainly scenarios where it won't
> be greatly helpful as well as scenarios where it will, and several
> people seem to think there are enough of the latter.
>
> Maybe, don't knock it till you've tried it?

I am really not knocking the overall feature. I am just saying that in
the scenario I described there is no way I would expect svn up to
fetch the pristines for files just because I have local mods. I think
for users with really large files ... which I assume are the main
target user ... it will make the feature less useful than it would be
if this behavior did not exist.

I am not this user. I am just projecting what I think they would want.
I was hoping Karl might chime in and/or interview his users about what
they might think. I personally think finding a solution to this would
be valuable.

Mark

Re: A two-part vision for Subversion and large binary objects.

Posted by Julian Foad <ju...@apache.org>.

On Feb 18 2022, Mark Phippard wrote:
>> [It fetches and stores pristines of modified files;] it doesn't mean
>> "store no pristines" in that WC.
> 
> I am curious what Karl thinks given that he is living this scenario
> today and wants the feature. I would think that having update create
> pristines for any modified file taints its usefulness. That said, it
> is probably still better than what they have today and if the user is
> on a fast network and disk space is not too big of an issue it might
> not matter too much. I personally think this is the biggest issue to
> solve though, more so than selectively choosing pristines for
> different files. I think the feature just really does not "work as
> advertised" if it is going to behave this way.

Hello, Mark. Maybe Karl will yet answer, but I didn't want to leave this
hanging any longer.

This design was anticipated as far back as a 2006-06-09 comment on #525
by Oswald Buddenhagen [1], where it is described as one of the
possibilities among variations and alternatives. I'm not saying that
justifies choosing it as the best solution, just that it's not arriving
now from off the radar.

We've already discussed how there are certainly scenarios where it won't
be greatly helpful as well as scenarios where it will, and several
people seem to think there are enough of the latter.

Maybe, don't knock it till you've tried it?

As for "as advertised", then surely the take-away message is we need to
be careful how we describe it; never call it "without pristines".

- Julian

[1] https://issues.apache.org/jira/browse/SVN-525?focusedCommentId=14911121&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14911121

Re: A two-part vision for Subversion and large binary objects.

Posted by Mark Phippard <ma...@gmail.com>.

On Fri, Feb 18, 2022 at 3:21 PM Julian Foad <ju...@apache.org> wrote:
>
> Karl Fogel wrote:
> > Is the above happening in MVP?
>
> Yes. I was describing what Evgeny created last year in the 'pristines-on-demand' branch.
>
> > I ask because my understanding of
> > MVP was that it's not doing this opportunistic fetching/discarding
> > of bases, but rather that it's a simple per-WC setting that means
> > "store no pristines in this WC" and that's that.
>
> MVP is what Evgeny created, but selectively turned on, or not, per WC. When turned on, it doesn't mean "store no pristines" in that WC.

I am curious what Karl thinks given that he is living this scenario
today and wants the feature. I would think that having update create
pristines for any modified file taints its usefulness. That said, it
is probably still better than what they have today and if the user is
on a fast network and disk space is not too big of an issue it might
not matter too much. I personally think this is the biggest issue to
solve though, more so than selectively choosing pristines for
different files. I think the feature just really does not "work as
advertised" if it is going to behave this way.

Mark

Re: A two-part vision for Subversion and large binary objects.

Posted by Julian Foad <ju...@apache.org>.

Karl Fogel wrote: 
> Is the above happening in MVP? 

Yes. I was describing what Evgeny created last year in the 'pristines-on-demand' branch.

> I ask because my understanding of 
> MVP was that it's not doing this opportunistic fetching/discarding 
> of bases, but rather that it's a simple per-WC setting that means 
> "store no pristines in this WC" and that's that.

MVP is what Evgeny created, but selectively turned on, or not, per WC. When turned on, it doesn't mean "store no pristines" in that WC.

I'm thinking two things would make the explanation more accessible.

1. Docs: release notes is a good start (thanks Nathan) but somewhere more (svn help?) too.

2. Feedback: svn should print progress notifications (maybe gated on '--verbose'), that make clear when and what it is doing with the pristines.

Number 2 would make a great volunteer contribution if anyone's willing to dip in to the branch code. It's just a matter of extending our existing notifier callback and making the hydrate/dehydrate call back to it.
- Julian

Re: A two-part vision for Subversion and large binary objects.

Posted by Karl Fogel <kf...@red-bean.com>.

On 18 Feb 2022, Julian Foad wrote:
>To understand, we need to recap that this design is based around 
>a
>simple invariant: whenever a file is seen to be locally modified, 
>at the
>next convenient opportunity we will download its base; and when 
>seen to
>be not-modified we will discard its base. It is not a
>fetch-at-point-of-access design.

This seems like a good principle for the long run (and 
well-articulated above, thank you!).

Is the above happening in MVP?  I ask because my understanding of 
MVP was that it's not doing this opportunistic fetching/discarding 
of bases, but rather that it's a simple per-WC setting that means 
"store no pristines in this WC" and that's that.

That means it would be up to the user to use the available 
workarounds if they need to do things (like local diff) that would 
require a pristine.  Fortunately those workarounds are easy: they 
just involve copying a file sometimes before you start working on 
it :-).

Just to be be super clear: my question here is solely about MVP -- 
about the first released version of a usable 525-enabled 
Subversion -- not about the longer term plans, which I agree are 
excellent and will make the feature even better.

Best regards,
-Karl

Re: A two-part vision for Subversion and large binary objects.

Posted by Julian Foad <ju...@apache.org>.

Mark Phippard wrote:
>> Update starts by hydrating. That means it WILL download any missing
>> pristines of modified files, regardless whether any newer revision
>> will be found.
> 
> Does the possibility exist to optimize this at all? [...]

To understand, we need to recap that this design is based around a
simple invariant: whenever a file is seen to be locally modified, at the
next convenient opportunity we will download its base; and when seen to
be not-modified we will discard its base. It is not a
fetch-at-point-of-access design.

This design uses a "wrapper" structure, localising the hydration and
dehydration steps at the top level of subcommands, outside the command
logic, not inside the command logic. It does not know which text bases
the command will need, and instead works on the principle that it will
fetch all that might be needed. Evgeny and others suggested that
localizing the network access to the start of each subcommand was better
(for client software and users) than introducing the possibility for
network access to be required at arbitrary points inside the command logic.

The rationale is that in the design use cases, fetching is acceptably
cheap both in network speed and in the availability of sufficient
storage space for text bases for the files that are locally modified.

Possibilities for optimisation may exist but the kind of optimisation
you are aksing about would require a different design, one in which the
fetcher knows what is really needed by the current operation, so one
where the fetch is pushed down in the logic to nearer the point of
access. There is limited scope for it in this design.

Personally I would like us to explore the fetch-at-point-of-access
design alternative. While introducing one kind of complexity (network
access requested at arbitrary points during a command) I feel it would
reduce the kind of complexity we're discussing here (unnecessary fetches
and consequent attempts at optimisation). But that's not what we're
exploring right now.

I asked about this in this thread a few weeks ago; you could see there
for further discussion. (I tried to dig up a link but having trouble
finding myself in the archives.)

- Julian

Re: A two-part vision for Subversion and large binary objects.

Posted by Mark Phippard <ma...@gmail.com>.

On Thu, Feb 17, 2022 at 5:52 PM Julian Foad <ju...@apache.org> wrote:
>
> Mark Phippard wrote:
> >> | update/switch | Always | Always + Hydrate |
> >
> >Can you expand on this one a bit? I presume what you mean is if you
> >have local mods to a file and run update/switch and there is a newer
> >revision of the file in the repository it will hydrate that new
> >version? But otherwise, I assume just running update does not create
> >pristines.
> Let's delete the word "Always" from both columns.
>
> Update starts by hydrating. That means it WILL download any missing pristines of modified files, regardless whether any newer revision will be found.

Does the possibility exist to optimize this at all? Say I am working
on a large binary that I have edited. My colleague checks in a
different file that I want to look at so I run update. Now SVN is
going to unnecessarily download another copy of that large binary.

The WC knows the revision, I do not see why it would download the file
from the server on update if it is not a newer version.

Mark

Re: A two-part vision for Subversion and large binary objects.

Posted by Julian Foad <ju...@apache.org>.

Mark Phippard wrote:
>> | update/switch | Always | Always + Hydrate |
>
>Can you expand on this one a bit? I presume what you mean is if you
>have local mods to a file and run update/switch and there is a newer
>revision of the file in the repository it will hydrate that new
>version? But otherwise, I assume just running update does not create
>pristines.
Let's delete the word "Always" from both columns.

Update starts by hydrating. That means it WILL download any missing pristines of modified files, regardless whether any newer revision will be found.

- Julian

Re: A two-part vision for Subversion and large binary objects.

Posted by Mark Phippard <ma...@gmail.com>.

On Thu, Feb 17, 2022 at 5:09 PM Julian Foad <ju...@apache.org> wrote:
>
> Awesome, Nathan!

I agree.

> | update/switch | Always | Always + Hydrate |

Can you expand on this one a bit? I presume what you mean is if you
have local mods to a file and run update/switch and there is a newer
revision of the file in the repository it will hydrate that new
version? But otherwise, I assume just running update does not create
pristines.

Mark

Re: A two-part vision for Subversion and large binary objects.

Posted by Julian Foad <ju...@apache.org>.

Awesome, Nathan! I was going to say this is clearly a priority. Thanks
so much for writing that. It is so much easier to iterate on it now you
have begun. At first glance there is not much I would add or change.

Not sure about the word "bare"; for now I'll read it as a place-holder
for whatever term we agree on.

Suggested edits below.

Nathan Hartman wrote:
[...]
> An initial, very-rough, draft, of what might go in the 1.15 release
> notes, at least as I see this feature. Obviously, let me know if I'm
> totally off here!
> 
> New Client Features -> Bare Working Copies
> 
> All Subversion working copies require extra storage space in addition
> to the size of the checked out files.
> 
> By default, the total storage space required is slightly more than
> double the size of the checked out files. Subversion uses most of that
> extra space to cache each file's BASE revision so that many operations
> can work faster and offline.

"so that operations such as diff and revert can work offline, and
commit can send just the modified portions of a file to the repository
server rather than the whole file. This optimises the speed and
availability of these operations, on the assumption that network
throughput to the server is often a bottleneck."

> Starting in 1.15, users can check out a bare working copy to cut the
> storage requirement by up to 50%. Instead of caching the BASE revision
> of all files all the time, Subversion will only fetch and cache those
> of individual files when needed, and will eliminate them when no
> longer needed. The space savings come at a tradeoff of requiring a
> connection to the repository for more operations as compared to a
> normal working copy and may, depending on network speeds and file
> sizes, introduce a perceptible delay when a BASE file is downloaded.
> 
> This feature is motivated by use cases involving very large versioned
> files that change infrequently, where keeping the cached BASE copy
> wastes space and provides little or no benefit. This feature may also
> be useful in other scenarios, such as where a very fast connection to
> the server is available, the repository is local, available storage
> space is very limited, etc.
> 
> To check out a bare working copy:
> 
> $ svn checkout --foo --bar $REPO $WC
> 
> The command to check out a normal working copy is unchanged.

The following table lists the Subversion commands that behave
differently in a bare working copy. For each command, it shows the
difference in how that command accesses the repository.

> +----------+-------------------------------+
> |          | Working Copy Type             |
> +----------+---------------+---------------+
> | Command  | Normal        | Bare          |
> +----------+---------------+---------------+

| cat (BASE) | No | Hydrate |
| commit | Send-Delta | Send-Full |
| conflict resolving (resolve/merge/up/sw) | Sometimes | Sometimes (...) |
| diff (BASE) | No | Hydrate |
| revert | No | Hydrate |
| update/switch | Always | Always + Hydrate |

> Legend:

* Hydrate: this operation downloads and keeps the file's base revision,
for each file that has a local content modification ('svn status' shows
'M' in the 1st column) and its base is not already stored in the working
copy [1][2].

* Send-Delta: sends just the locally modified parts of each file's content.

* Send-Full: sends the complete content of each locally modified file.

* No: does not contact the server.

* Always: always contacts the server.

Once downloaded, Subversion keeps a file's base locally cached in the
working copy, so that further operations on the file will not download
the base from the repository again. It keeps the base in this way until
one of these operations either restores the file to an unmodified state
or detects that the file is no longer modified. For example, "commit"
and "revert" will immediately discard the base of each file they
operated on, because that file will no longer be locally modified,
whereas "diff" will discard the base only if it finds there are no differences.

[1] At the beginning of a given operation, Subversion will download
missing bases of *at least* the files that this particular operation
will use. It may download those of other files too, that this particular
operation will not use. For example, in the initial implementation of
this feature, Subversion considers all potential files in the smallest subtree
that spans all the target files of the operation. The details of this
behaviour are subject to change before and after the feature is released.

[2] In evaluating differences between a file's working text and its base
text, Subversion takes into account the "EOL style" and "keywords"
settings. (See the 'svn:eol-style' and 'svn:keywords' properties.)
Just as 'svn status' does not show 'M' in the first column for such
differences, neither will these cause the base to be downloaded from the repository.

Re: A two-part vision for Subversion and large binary objects.

Posted by Nathan Hartman <ha...@gmail.com>.

On Thu, Feb 17, 2022 at 9:41 AM Julian Foad <ju...@foad.me.uk> wrote:
>
> Mark Phippard wrote:
> [...]
> >currently there is no pristine. Now I run svn diff and I see the
> >result. The command ends ... there is still no pristine. [...] If, when the
> >command finishes, there are no pristines stored on disk ... then there
> >are no pristines.
> The bit you are missing is, at the end of a diff command, the diffed file is still locally modified so the pristine is not felted then, it is kept in the store indefinitely. Only when some subsequent command finds it unmodified, only then will it be removed. (No matter whether it became unmodified by an svn operation such as revert or commit, in which case the command that did that would have removed the pristine; or if it was reverted by the user without involving svn.)
>
> I agree about the insufficiency of current docs and how the commands could be listed in full.

An initial, very-rough, draft, of what might go in the 1.15 release
notes, at least as I see this feature. Obviously, let me know if I'm
totally off here!

New Client Features -> Bare Working Copies

All Subversion working copies require extra storage space in addition
to the size of the checked out files.

By default, the total storage space required is slightly more than
double the size of the checked out files. Subversion uses most of that
extra space to cache each file's BASE revision so that many operations
can work faster and offline.

Starting in 1.15, users can check out a bare working copy to cut the
storage requirement by up to 50%. Instead of caching the BASE revision
of all files all the time, Subversion will only fetch and cache those
of individual files when needed, and will eliminate them when no
longer needed. The space savings come at a tradeoff of requiring a
connection to the repository for more operations as compared to a
normal working copy and may, depending on network speeds and file
sizes, introduce a perceptible delay when a BASE file is downloaded.

This feature is motivated by use cases involving very large versioned
files that change infrequently, where keeping the cached BASE copy
wastes space and provides little or no benefit. This feature may also
be useful in other scenarios, such as where a very fast connection to
the server is available, the repository is local, available storage
space is very limited, etc.

To check out a bare working copy:

$ svn checkout --foo --bar $REPO $WC

The command to check out a normal working copy is unchanged.

The following table lists all Subversion commands and whether they
need to access the repository:

+----------+-------------------------------+
|          | Working Copy Type             |
+----------+---------------+---------------+
| Command  | Normal        | Bare          |
+----------+---------------+---------------+
| add      | Never         | Never         |
| .        |               |               |
| .        |               |               |
| .        |               |               |
| checkout | Always        | Always        |
| .        |               |               |
| .        |               |               |
| .        |               |               |
| diff     | Remote URL    | When modified |
| .        |               |               |
| .        |               |               |
| .        |               |               |
| revert   | Never         | Always        |
| .        |               |               |
| .        |               |               |
| .        |               |               |
+----------+---------------+---------------+

Legend:

* Never: This operation never contacts the repository.

* Remote URL: This operation contacts the repository only if given a
  repository path. It does not contact the repository when operating
  on a local path.

* When Modified: This operation contacts the repository when the path
  in question is locally modified ('svn status' shows 'M' in the 1st
  column) or is provided with a repository URL.

* Always: This operation always contacts the repository.

Additional Details:

When operating on a bare working copy, the Subversion client will
download the BASE revision of a file when it detects that the file is
locally modified and an operation involving that file requires the
BASE revision.

Once downloaded, the BASE revision will remain locally cached until a
further operation either restores the file to an unmodified state or
detects that the file is no longer modified.

Cheers,
Nathan

Re: A two-part vision for Subversion and large binary objects.

Posted by Mark Phippard <ma...@gmail.com>.

On Thu, Feb 17, 2022 at 9:41 AM Julian Foad <ju...@foad.me.uk> wrote:
>
> Mark Phippard wrote:
> [...]
> >currently there is no pristine. Now I run svn diff and I see the
> >result. The command ends ... there is still no pristine. [...] If, when the
> >command finishes, there are no pristines stored on disk ... then there
> >are no pristines.
> The bit you are missing is, at the end of a diff command, the diffed file is still locally modified so the pristine is not felted then, it is kept in the store indefinitely. Only when some subsequent command finds it unmodified, only then will it be removed. (No matter whether it became unmodified by an svn operation such as revert or commit, in which case the command that did that would have removed the pristine; or if it was reverted by the user without involving svn.)
>
> I agree about the insufficiency of current docs and how the commands could be listed in full.

Thanks Julian, that is clearer. FWIW, this is the wording that confused me:

"The operations also include a final step during which the no longer
required text-bases are removed from disk"

The use of "the operations" made me think it was the same operation.
IOW, I read this as the "diff operation" includes a final step that
does this cleanup. What you seem to be saying is some other future
operation like revert or commit is what would clean it up.

I agree this is a reasonable way for the feature to behave BTW. At
least assuming there are no "surprises" about what operations trigger
the need to fetch the pristine.

Mark

Re: A two-part vision for Subversion and large binary objects.

Posted by Julian Foad <ju...@foad.me.uk>.

Mark Phippard wrote:
[...]
>currently there is no pristine. Now I run svn diff and I see the
>result. The command ends ... there is still no pristine. [...] If, when the
>command finishes, there are no pristines stored on disk ... then there
>are no pristines.
The bit you are missing is, at the end of a diff command, the diffed file is still locally modified so the pristine is not felted then, it is kept in the store indefinitely. Only when some subsequent command finds it unmodified, only then will it be removed. (No matter whether it became unmodified by an svn operation such as revert or commit, in which case the command that did that would have removed the pristine; or if it was reverted by the user without involving svn.)

I agree about the insufficiency of current docs and how the commands could be listed in full.

- Julian

Re: A two-part vision for Subversion and large binary objects.

Posted by Mark Phippard <ma...@gmail.com>.

On Wed, Feb 16, 2022 at 9:59 AM Julian Foad <ju...@foad.me.uk> wrote:
>
> Mark Phippard wrote:
> >> "The core idea is that we start to maintain the following invariant: only the modified files have their pristine text-base files available on the disk."
> >> (https://svn.apache.org/repos/asf/subversion/branches/pristines-on-demand-on-mwf/BRANCH-README)
> >
> >That was where I read it! thanks
>
> (The Readme is Evgeny's text AFAIK.)
>
> >So this text confuses me and makes me assume I am not reading it
> >correctly. Suppose I use this new feature to checkout a new WC without
> >any pristines. I make edits to a large binary file using some tool. At
> >this point, SVN does not even know I have done anything so I still
> >have no pristines.
>
> Correct.
>
> >If I run svn status it will show me the file is modified. Are you
> >saying that when I do this, SVN is going to pull down a pristine from
> >the server?
>
> Not for "status". Does the further description from the readme help?:
> """
>   - To get into the appropriate state at the beginning of the operation, we walk through the current text-base info in the db and check if the corresponding working files are modified. The missing text-bases are fetched using the svn_ra layer.

Yes and No. To truly understand this the way it is written requires a
lot of internal knowledge about how SVN works. As someone whose
knowledge is more as a user I am not clear what situations would
require the missing text-bases to be fetched. I think that has been
the heart of my question all along. I certainly understand they are
needed for diff and revert, for example. I would want to know if the
more day to day update/commit cycle needs them. If they do not, then
it sounds good to me.

> The operations also include a final step during which the no longer required text-bases are removed from disk.

This is slightly confusing. I assume it means that the pristines are
not truly stored on disk in that by the time the command finishes they
are gone again? So if I run svn diff, it will fetch the pristines, use
them to complete the operation, and then discard them? FWIW, if this
is how it works I think that is good.

>   - The operations that don't need to access the text-bases (such as "svn ls" or the updated "svn st") do not perform this walk and do not synchronize the text-base state.
> """

So I would just reiterate the earlier comment that it is unclear which
commands need to fetch pristines. It seems like it would be relatively
easy to just spell it out in the docs? It has to be a smallish number
of commands.

>
> > [...] My assumption is that if I have one of these new types of WC's that I
> > will NEVER have any "pristines".
>
> Not correct, for this design.

If the pristines are discarded at the end of a command, as I noted
above, then I think it would be fair to simply describe this feature
as a "bare WC" without pristines. The fact that the pristines
temporarily exist during certain commands is an irrelevant detail.
After all, they also exist during checkout before the final version of
the file is written to disk with line endings and keywords expanded.

>
> >Please enlighten me as to when pristines will be created and stored
> >and why I would want SVN to do that when I asked for no pristines? I
> >think I must be overlooking something obvious.
>
> In my own words now: In this design, pristines are kept locally for modified files (not never). During the new pristines-sync phase at the beginning of any operation that might[1] want the pristine (e.g. diff, but *not* status), if a file is detected to be locally modified and has no pristine locally, then the pristine is fetched ("hydrated") and then kept locally... until, during a sync pass at the end of any such operation, if any file is detected to be not-modified and its pristine is present locally: then it is cleaned up (dehydrated).
>
> Hope that's getting clearer.

This all sounds super internal. It might be good for developer level
explanation but not user level. If I have modified a file and
currently there is no pristine. Now I run svn diff and I see the
result. The command ends ... there is still no pristine. The fact that
something existed during the operation in order to produce the result
is uninteresting to me and it confuses the explanation. If, when the
command finishes, there are no pristines stored on disk ... then there
are no pristines.

Does this make sense?

Mark

Re: A two-part vision for Subversion and large binary objects.

Posted by Julian Foad <ju...@foad.me.uk>.

Mark Phippard wrote:
>> "The core idea is that we start to maintain the following invariant: only the modified files have their pristine text-base files available on the disk."
>> (https://svn.apache.org/repos/asf/subversion/branches/pristines-on-demand-on-mwf/BRANCH-README)
>
>That was where I read it! thanks

(The Readme is Evgeny's text AFAIK.)

>So this text confuses me and makes me assume I am not reading it
>correctly. Suppose I use this new feature to checkout a new WC without
>any pristines. I make edits to a large binary file using some tool. At
>this point, SVN does not even know I have done anything so I still
>have no pristines.

Correct.

>If I run svn status it will show me the file is modified. Are you
>saying that when I do this, SVN is going to pull down a pristine from
>the server?

Not for "status". Does the further description from the readme help?:
"""
  - To get into the appropriate state at the beginning of the operation, we walk through the current text-base info in the db and check if the corresponding working files are modified. The missing text-bases are fetched using the svn_ra layer. The operations also include a final step during which the no longer required text-bases are removed from disk.
  - The operations that don't need to access the text-bases (such as "svn ls" or the updated "svn st") do not perform this walk and do not synchronize the text-base state. 
"""

> [...] My assumption is that if I have one of these new types of WC's that I
> will NEVER have any "pristines".

Not correct, for this design.

>Please enlighten me as to when pristines will be created and stored
>and why I would want SVN to do that when I asked for no pristines? I
>think I must be overlooking something obvious.

In my own words now: In this design, pristines are kept locally for modified files (not never). During the new pristines-sync phase at the beginning of any operation that might[1] want the pristine (e.g. diff, but *not* status), if a file is detected to be locally modified and has no pristine locally, then the pristine is fetched ("hydrated") and then kept locally... until, during a sync pass at the end of any such operation, if any file is detected to be not-modified and its pristine is present locally: then it is cleaned up (dehydrated).

Hope that's getting clearer.

[1] "might want": false positives exist, as noted in this thread a few weeks ago.

- Julian

Re: A two-part vision for Subversion and large binary objects.

Posted by Mark Phippard <ma...@gmail.com>.

On Wed, Feb 16, 2022 at 4:16 AM Julian Foad <ju...@apache.org> wrote:

> "The core idea is that we start to maintain the following invariant: only the modified files have their pristine text-base files available on the disk."
> (https://svn.apache.org/repos/asf/subversion/branches/pristines-on-demand-on-mwf/BRANCH-README)

That was where I read it! thanks

So this text confuses me and makes me assume I am not reading it
correctly. Suppose I use this new feature to checkout a new WC without
any pristines. I make edits to a large binary file using some tool. At
this point, SVN does not even know I have done anything so I still
have no pristines.

If I run svn status it will show me the file is modified. Are you
saying that when I do this, SVN is going to pull down a pristine from
the server? That seems very unlikely to me but I cannot otherwise
imagine what your wording in the BRANCH-README would be describing. My
assumption is that if I have one of these new types of WC's that I
will NEVER have any "pristines".

Please enlighten me as to when pristines will be created and stored
and why I would want SVN to do that when I asked for no pristines? I
think I must be overlooking something obvious.

Mark

Re: A two-part vision for Subversion and large binary objects.

Posted by Julian Foad <ju...@apache.org>.

Re. being unsure exactly what the feature behaviour is: we should have a clear description somewhere permanent. Release notes for a start. Help text as well? For now, the BRANCH-README says it this way:

"The core idea is that we start to maintain the following invariant: only the modified files have their pristine text-base files available on the disk."
(https://svn.apache.org/repos/asf/subversion/branches/pristines-on-demand-on-mwf/BRANCH-README)

And it gives more detail.

Re. intended UI for enabling it: I do think we need an explicit option to enable the feature by name, not just a WC version number. I haven't yet worked out whether it must also be possible to upgrade to 1.15 format without enabling the feature, and thus need to store the feature-enable flag in the WC somewhere separate from the format version number. For future developments of other wc features, that will be needed; I just haven't finalised yet if it's essential for MVP. Might be, in order to not cause compatibility issues for those future scenarios.

- Julian

- Julian

Re: A two-part vision for Subversion and large binary objects.

Posted by Karl Fogel <kf...@red-bean.com>.

On 15 Feb 2022, Nathan Hartman wrote:
>Possibly bikeshedding a bit, but this seems to return to the idea 
>of
>"turning on" what we are (tentatively) calling "local 
>base"... IMHO it
>would be better if it were reversed to "--remote-base=yes" to 
>convey
>that this is non-default and opt-in. (Or possibly allow both.)

The reason I shy away from the "--remote-base=foo" name is that 
there is *always* a remote base anyway.  Even when one has 
pristines locally, there is also a remote pristine available (and 
indeed the server makes use of it sometimes).  So that name would 
be misleading, and for more knowledgeable users, even confusing.

>Alternatively...
>
>As a command line switch, how about:
>
>"svn checkout --base=local $REPO $WC"
>or
>"svn checkout --base=remote $REPO $WC"

This implies a symmetry between control of local-base presence and 
control of remote-base presence, but there is no such symmetry. 
The only thing this feature can ever control is the presence of 
local bases, so I think it would be a mistake to say anything 
about remote bases when addressing it.

Best regards,
-Karl

Re: A two-part vision for Subversion and large binary objects.

Posted by Nathan Hartman <ha...@gmail.com>.

On Tue, Feb 15, 2022 at 2:22 PM Karl Fogel <kf...@red-bean.com> wrote:
> As a command-line option for per-WC behavior, it might be
> something like this on checkout:
>
>   --local-base=no
>
> When the option is not provided, the default would be "yes" of
> course (in a sense, it's been defaulting to "yes" for decades :-)
> ).
>
> As a configuration option, it would be something like this:

Possibly bikeshedding a bit, but this seems to return to the idea of
"turning on" what we are (tentatively) calling "local base"... IMHO it
would be better if it were reversed to "--remote-base=yes" to convey
that this is non-default and opt-in. (Or possibly allow both.)

Alternatively...

As a command line switch, how about:

"svn checkout --base=local $REPO $WC"
or
"svn checkout --base=remote $REPO $WC"

The default of "--base=local" would be, as Karl points out, the same
behavior as in past releases, unless the user configures otherwise.

When checking out, it would only be necessary to specify
"--base=local" or "--base=remote" if that differs from the configured
(or implied) default.

A possible future way to change pristine storage in an existing
working copy:

"svn update --set-base=local"
or
"svn update --set-base=remote"

That conceptually mirrors the current "svn checkout --depth=xxx" and
"svn update --set-depth=xxx", (modulo the fact that --depth=xxx has
another meaning for operations beside checkout).

Hopefully this doesn't cause confusion with, e.g., "--accept=base".

No opinion yet regarding configuration option naming.

Cheers,
Nathan

Re: A two-part vision for Subversion and large binary objects.

Posted by Karl Fogel <kf...@red-bean.com>.

On 15 Feb 2022, Nathan Hartman wrote:
>How about:
>
>Remote BASE
>
>(as opposed to Local BASE).
>
>The idea here being that BASE is a concept with which users 
>should be
>familiar, while pristines are part of Subversion's implementation
>under the hood.

Getting closer, I think!  "base" seems like a good word -- more 
familiar to most users than "pristine" would be, and the meaning 
really is pretty spot-on, since we've been supporting "-rBASE" 
since forever.

As a command-line option for per-WC behavior, it might be 
something like this on checkout:

  --local-base=no

When the option is not provided, the default would be "yes" of 
course (in a sense, it's been defaulting to "yes" for decades :-) 
).

As a configuration option, it would be something like this:

  ### Section for configuring working copies.
  [working-copy]
  no-local-base-FOO = SOME_VALUE

Now, I don't know what FOO and SOME_VALUE are yet -- they will 
vary, because we'll want various behaviors.  Sometimes you'll want 
to say "no local base when checking out from this particular 
repository".  Later, when we support finer-grained local base 
control than just per-WC, we'll want to be able to say "no local 
base for files larger than size X".  And maybe we'll want to say 
"no local base for files that have the following property or 
prop/val combination".  E.g.,

  ### Section for configuring working copies.
  [working-copy]
  no-local-base-repositories = [LIST OF REGEXPS TO MATCH AGAINST]
  no-local-base-properties = BLAH BLAH
  no-local-base-size-threshold = 1GB
  no-local-base-FOO = etc, etc

We don't have to figure out that config UI right now.  I'm just 
trying to figure out the primary user-facing terminology for the 
feature, and maybe "local base" is it.

Thoughts?

Best regards,
-Karl

Re: A two-part vision for Subversion and large binary objects.

Posted by Nathan Hartman <ha...@gmail.com>.

On Mon, Feb 14, 2022 at 4:54 PM Karl Fogel <kf...@red-bean.com> wrote:

>
> ROTFL :-).  I'll take #2 with a side of onion rings, please.
>
> Those are *descriptions*, for the release notes and other
> documentation, but we will still need a *name* too, to use in the
> command-line flag (or config option, whatever).

How about:

Remote BASE

(as opposed to Local BASE).

The idea here being that BASE is a concept with which users should be
familiar, while pristines are part of Subversion's implementation under the
hood.

Cheers
Nathan

Re: A two-part vision for Subversion and large binary objects.

Posted by Karl Fogel <kf...@red-bean.com>.

On 14 Feb 2022, Julian Foad wrote:
>Karl, thanks for bringing a user-focused perspective to the 
>naming. In
>Subversion's UI we will not necessarily expose any name for the 
>feature,
>but we might, e.g. in a configuration file or in help text. In
>describing what's new in 1.15 people will certainly start using 
>some
>short name for the feature and it will be helpful if we pick a 
>memorable
>and user-comprehensible one to start with.

Agree with all your concerns below, but we do have to pick a name 
soon, because there's going to be *some* UI for accessing this 
feature, right? 

That is, even in MVP where the feature is just a non-changeable 
per-WC-at-checkout-time decision about whether pristines are 
cached or not, there has to be some way for the user to specify at 
checkout time what that decision actually *is* for a given WC. 
That could be done via a command-line option, or via something in 
a config file, but whatever it is, it's going to involve having a 
name by which to call the feature.

>In this case the term "binary, large object" is an apt portrayal 
>of the
>main use case. (I'll set aside my distaste for such inelegant
>backronyms.) However, the term is used (I think) primarily in 
>software
>developer circles, which covers a significant portion but perhaps 
>not
>vast majority of Subversion users. So it might not be the best.
>Similarly, casual Subversion users don't really need to know the 
>terms
>"pristine" and "text base" before they become "power users", 
>although if
>they think about this feature at all then they will obviously 
>become
>aware of the concept, if not those names.

Agreed.

>Johan, I like and agree with your perspective of this pristines
>management as one aspect of caching repository content in 
>general, with
>scope for further variations. I am not sure if that is the best 
>way to
>present it to users at this stage. Perhaps this perspective fits 
>better
>in a "road map" that devs and users interested in further 
>development
>can read.
>
>Ideas:
>
>    - "Option to optimize a checkout for minimal disk space 
>    rather than
>minimal network traffic."
>
>    - "50% off. Unlimited offer. Buy it now. Shrink your 
>    checkouts to
>half the size.*  (Small print: *Compared to our previous 
>checkouts which
>cost double their effective size. Network subscription 
>required.)"

ROTFL :-).  I'll take #2 with a side of onion rings, please.

Those are *descriptions*, for the release notes and other 
documentation, but we will still need a *name* too, to use in the 
command-line flag (or config option, whatever).

Best regards,
-Karl

Re: A two-part vision for Subversion and large binary objects.

Posted by Nathan Hartman <ha...@gmail.com>.

On Mon, Feb 14, 2022 at 6:07 AM Julian Foad <ju...@apache.org> wrote:
> Ideas:
>
>     - "Option to optimize a checkout for minimal disk space rather than
> minimal network traffic."
>
>     - "50% off. Unlimited offer. Buy it now. Shrink your checkouts to
> half the size.*  (Small print: *Compared to our previous checkouts which
> cost double their effective size. Network subscription required.)"

You know, I was going to offer some tongue-in-cheek ones myself...

Honey I Shrunk The Working Copy
Half Price Working Copy
Working Copy 4 Less

But I'm still trying to think of name we could actually use.

Personally I'm not very partial to the BLOB idea. It feels like more
of a SQL database centric concept, and that makes it seem kind of
dated in my mind. We need something that sounds new and cool. :-)

The feature does trade off higher network access for lower storage
consumption, and that is an aspect the user would need to know to make
an educated decision, so we'll probably want to bear that in mind.

Cheers,
Nathan

Re: A two-part vision for Subversion and large binary objects.

Posted by Julian Foad <ju...@apache.org>.

Karl, thanks for bringing a user-focused perspective to the naming. In
Subversion's UI we will not necessarily expose any name for the feature,
but we might, e.g. in a configuration file or in help text. In
describing what's new in 1.15 people will certainly start using some
short name for the feature and it will be helpful if we pick a memorable
and user-comprehensible one to start with.

In this case the term "binary, large object" is an apt portrayal of the
main use case. (I'll set aside my distaste for such inelegant
backronyms.) However, the term is used (I think) primarily in software
developer circles, which covers a significant portion but perhaps not
vast majority of Subversion users. So it might not be the best.
Similarly, casual Subversion users don't really need to know the terms
"pristine" and "text base" before they become "power users", although if
they think about this feature at all then they will obviously become
aware of the concept, if not those names.

Johan, I like and agree with your perspective of this pristines
management as one aspect of caching repository content in general, with
scope for further variations. I am not sure if that is the best way to
present it to users at this stage. Perhaps this perspective fits better
in a "road map" that devs and users interested in further development
can read.

Ideas:

    - "Option to optimize a checkout for minimal disk space rather than
minimal network traffic."

    - "50% off. Unlimited offer. Buy it now. Shrink your checkouts to
half the size.*  (Small print: *Compared to our previous checkouts which
cost double their effective size. Network subscription required.)"

- Julian

Re: A two-part vision for Subversion and large binary objects.

Posted by Johan Corveleyn <jc...@gmail.com>.

On Mon, Feb 14, 2022 at 11:13 AM Ivan Zhakov <ch...@gmail.com> wrote:
> On Mon, 14 Feb 2022 at 01:39, Karl Fogel <kf...@red-bean.com> wrote:
>> On 12 Feb 2022, Mark Phippard wrote:
...
>> In any case, the branch name doesn't matter too much here,
>> especially since it's going to get merged soon.  However, for the
>> user-facing name of the feature, we should pick a name based on
>> the essence of the feature, not on a not-yet-fully-implemented
>> optional enhancement to the feature, discussed further below.
>>
>> On 13 Feb 2022, Julian Foad wrote:
>> >That name came, as far as I am aware, from Evgeny's branch which
>> >implements the latter.
>> >
>> >This may be a case where the public facing name for the feature
>> >ought to differ from the internal development name.
>> >
>> >Any ideas for a good public name?
>> >
>> >Pristines on Subversion's demand?
>> >Dehydrated WC?
>>
>> I kind of like the dehydration/rehydration theme -- it's certainly
>> memorable!  Other possibilities:
>>
>>   - blob-optimized checkouts
>>
>>   - "blobtimized" checkouts (okay, kidding there... :-) )
>>
> I would suggest:
> - optional pristines

As I tried to explain before, I think it makes more sense (also to new
users who have never used pre-1.15) to try to expose the feature as a
knob for the pristine storing (or caching) strategy. Because,
effectively, the pristine store is just a cache, right? All the
information is there on the server, and the client simply duplicates /
caches that information locally to make some operations more
efficient. Up until know, the pristine caching strategy was fixed:
"cache them all, all the time, forever".

So now we're working on a very lazy or minimal type of pristine
caching strategy (or "no caching", if you will -- we might consider it
an implementation detail that a pristine is fetched in the "regular
pristine store" for a moment, and cleaned up after the operation -- it
might just as well have been spooled to a tmp location, or in memory,
or ... during the operation).

To expose this to users, I would take a step back, and open the door
for other types of pristine caching strategy in the future. So I'd
say:

"New feature in 1.15: Configurable Pristine Caching", or "Flexible
Pristine Caching" or "Pristine Caching Options". Where it was
previously a fixed strategy, you now have some choice. In 1.15 we
introduce the "lazy" (or "short-lived", or "minimal") pristine caching
strategy. Apart from that we still have the (default, old) "full" /
"complete" caching strategy. In the future we might introduce
additional (more flexible) strategies, such as those dictated by some
rules, potentially with a repos-side suggestion (like with
svn:auto-props).

Instead of taking about "Pristine Caching Strategy", we could also
talk about the "Pristine Strorage Strategy" or "Storing Strategy"
('storing' instead of 'fetching', as the former is the more permanent
effect; fetching might be seen as an implementation detail on what
subversion needs to do when it runs into a non-stored pristine).

-- 
Johan

Re: A two-part vision for Subversion and large binary objects.

Posted by Ivan Zhakov <ch...@gmail.com>.

On Mon, 14 Feb 2022 at 01:39, Karl Fogel <kf...@red-bean.com> wrote:

> On 12 Feb 2022, Mark Phippard wrote:
> >Just to offer a counterpoint Karl, I always assumed the goal of
> >the
> >branch was to have no pristines in the WC and the "on-demand"
> >aspect
> >was referring to an internal SVN detail that it would have to
> >fetch
> >pristines when they were needed to complete a command that I have
> >executed like diff or revert.
> >
> >I know we discussed whether the entire WC, or individual files
> >would
> >not have pristines but I never considered the "on-demand" aspect
> >to be
> >about my ability to decide this. It was about SVN just doing what
> >it
> >needed to when it needed to.
>
> Ah, I see.  That might be where the branch name came from, yeah.
> But the key (necessary) part of the feature is the absence of
> pristines, whereas the restoration of some pristines on demand is
> an optional enhancement (and one we're not even doing in the first
> MVP version).
>
> In fact, selected rehydration is not necessarily even the first
> enhancement we might make after MVP.  There's an argument for
> prioritizing flexible client-side configuration specs first, so
> that all the diffable files get pristines on checkout while all
> the big binary blob files get no pristines.  IOW, if we get the
> checkout right the first time, then selected rehydration becomes
> less important to have; also, there is an easy workaround for it;
> just make a copy of the working file :-).
>
> (I still think selected rehydration would be good to have, of
> course; I'm just pointing out that we haven't really discussed
> where it sits relative to other possible things.)
>
> In any case, the branch name doesn't matter too much here,
> especially since it's going to get merged soon.  However, for the
> user-facing name of the feature, we should pick a name based on
> the essence of the feature, not on a not-yet-fully-implemented
> optional enhancement to the feature, discussed further below.
>
> On 13 Feb 2022, Julian Foad wrote:
> >That name came, as far as I am aware, from Evgeny's branch which
> >implements the latter.
> >
> >This may be a case where the public facing name for the feature
> >ought to differ from the internal development name.
> >
> >Any ideas for a good public name?
> >
> >Pristines on Subversion's demand?
> >Dehydrated WC?
>
> I kind of like the dehydration/rehydration theme -- it's certainly
> memorable!  Other possibilities:
>
>   - blob-optimized checkouts
>
>   - "blobtimized" checkouts (okay, kidding there... :-) )
>
> I would suggest:
- optional pristines

Just my two cents.

-- 
Ivan Zhakov

Re: A two-part vision for Subversion and large binary objects.

Posted by Karl Fogel <kf...@red-bean.com>.

On 12 Feb 2022, Mark Phippard wrote:
>Just to offer a counterpoint Karl, I always assumed the goal of 
>the
>branch was to have no pristines in the WC and the "on-demand" 
>aspect
>was referring to an internal SVN detail that it would have to 
>fetch
>pristines when they were needed to complete a command that I have
>executed like diff or revert.
>
>I know we discussed whether the entire WC, or individual files 
>would
>not have pristines but I never considered the "on-demand" aspect 
>to be
>about my ability to decide this. It was about SVN just doing what 
>it
>needed to when it needed to.

Ah, I see.  That might be where the branch name came from, yeah. 
But the key (necessary) part of the feature is the absence of 
pristines, whereas the restoration of some pristines on demand is 
an optional enhancement (and one we're not even doing in the first 
MVP version).

In fact, selected rehydration is not necessarily even the first 
enhancement we might make after MVP.  There's an argument for 
prioritizing flexible client-side configuration specs first, so 
that all the diffable files get pristines on checkout while all 
the big binary blob files get no pristines.  IOW, if we get the 
checkout right the first time, then selected rehydration becomes 
less important to have; also, there is an easy workaround for it; 
just make a copy of the working file :-).

(I still think selected rehydration would be good to have, of 
course; I'm just pointing out that we haven't really discussed 
where it sits relative to other possible things.)

In any case, the branch name doesn't matter too much here, 
especially since it's going to get merged soon.  However, for the 
user-facing name of the feature, we should pick a name based on 
the essence of the feature, not on a not-yet-fully-implemented 
optional enhancement to the feature, discussed further below.

On 13 Feb 2022, Julian Foad wrote:
>That name came, as far as I am aware, from Evgeny's branch which 
>implements the latter.
>
>This may be a case where the public facing name for the feature 
>ought to differ from the internal development name.
>
>Any ideas for a good public name?
>
>Pristines on Subversion's demand?
>Dehydrated WC? 

I kind of like the dehydration/rehydration theme -- it's certainly 
memorable!  Other possibilities:

  - blob-optimized checkouts

  - "blobtimized" checkouts (okay, kidding there... :-) )

The first one is actually a serious suggestion, though.  It's more 
helpful for users if we frame the feature in terms of what it 
enables than in terms of back-end implementation.  What issue #525 
is about is optimizing for checkouts with lots of Binary Large 
OBjects -- things that don't generally receive mergeable changes 
and that one rarely if ever diffs.  Hence "blob-optimized 
checkouts" as the tag line (and then in the feature description we 
explain the details).

Anyway, that's one idea, but the floor is open...

Best regards,
-Karl

Re: A two-part vision for Subversion and large binary objects.

Posted by Mark Phippard <ma...@gmail.com>.

On Sat, Feb 12, 2022 at 3:15 PM Karl Fogel <kf...@red-bean.com> wrote:

> The name of the "pristines-on-demand" branch implies a certain
> behavior -- namely, that pristines can, via some UI, be fetched on
> demand :-).  But in the MVP we're talking about, pristines in a
> given WC are either all present or all absent, and, at least for
> MVP, that per-WC state is not changeable, right?

Just to offer a counterpoint Karl, I always assumed the goal of the
branch was to have no pristines in the WC and the "on-demand" aspect
was referring to an internal SVN detail that it would have to fetch
pristines when they were needed to complete a command that I have
executed like diff or revert.

I know we discussed whether the entire WC, or individual files would
not have pristines but I never considered the "on-demand" aspect to be
about my ability to decide this. It was about SVN just doing what it
needed to when it needed to.

Mark

Re: A two-part vision for Subversion and large binary objects.

Posted by Karl Fogel <kf...@red-bean.com>.

On 10 Feb 2022, Julian Foad wrote:
>My current plan:
>
>* multi-wc-format is, I consider, ready for merge to trunk. See 
>thread [1].
>    -> Please review it.
>    - I can post a diff and a summary log message to help 
>    reviewers.
>
>* Make pristines-on-demand behaviour conditional on WC format.
>    - The changes are mostly simple if a bit fiddly. In libsvn_wc 
>    we
>will need a bit of futzing around with two variants of some 
>existing
>wc-db queries, one with and one without the extra 'hydrated' 
>column, to
>work with both DB formats.
>
>* Re-base pristines-on-demand on top of multi-wc-format.
>    - I have this ready in a working copy.
>    - The one significant change is to remove the new bits from 
>    the main
>DB schema statements, so that it will create format 31 not 32, as 
>the
>new way (multi-wc-format) is to always create the baseline 
>(lowest
>supported) format first (which will be 31) and then run the 
>statements
>that upgrade it to any higher requested format (specifically, 
>when 32 is requested).
>
>* Finish the per-WC configuration. See thread [2].
>    -> Please review the plan there.
>
>At that point I would consider the feature a minimum viable 
>product
>(MVP), ready to merge and ready for use.
>
>Please do speak up with any comments.

Agree about MVP being ready by that point.  My only question is 
about a matter of intra-dev communications:

The name of the "pristines-on-demand" branch implies a certain 
behavior -- namely, that pristines can, via some UI, be fetched on 
demand :-).  But in the MVP we're talking about, pristines in a 
given WC are either all present or all absent, and, at least for 
MVP, that per-WC state is not changeable, right?  (That is, MVP 
doesn't include dehydration or rehydration, IIUC.)

Again, just to be clear: I think that's fine and the MVP will be 
very useful, even before any [rd]ehydration feature is available. 
But does the "pristines-on-demand" branch name still accurately 
reflect what the state of the onion will be after that branch is 
merged?

Best regards,
-Karl

Re: A two-part vision for Subversion and large binary objects.

Posted by Karl Fogel <kf...@red-bean.com>.

I wrote:
> ...does the "pristines-on-demand" branch name still accurately 
> reflect
> what the state of the onion will be after that branch is merged?

Ah, I'll retroactively update my question to now be about the new 
"pristines-on-demand-on-mwf" branch, of course.

Best regards,
-Karl

Re: A two-part vision for Subversion and large binary objects.

Posted by Julian Foad <ju...@apache.org>.

My current plan:

* multi-wc-format is, I consider, ready for merge to trunk. See thread [1].
    -> Please review it.
    - I can post a diff and a summary log message to help reviewers.

* Make pristines-on-demand behaviour conditional on WC format.
    - The changes are mostly simple if a bit fiddly. In libsvn_wc we
will need a bit of futzing around with two variants of some existing
wc-db queries, one with and one without the extra 'hydrated' column, to
work with both DB formats.

* Re-base pristines-on-demand on top of multi-wc-format.
    - I have this ready in a working copy.
    - The one significant change is to remove the new bits from the main
DB schema statements, so that it will create format 31 not 32, as the
new way (multi-wc-format) is to always create the baseline (lowest
supported) format first (which will be 31) and then run the statements
that upgrade it to any higher requested format (specifically, when 32 is requested).

* Finish the per-WC configuration. See thread [2].
    -> Please review the plan there.

At that point I would consider the feature a minimum viable product
(MVP), ready to merge and ready for use.

Please do speak up with any comments.


[1] "Multi-WC-format branch: preparing for merge to trunk"
[2] "[PATCH] Sketch of per-user/per-wc config for pristines-mode"

- Julian

Re: A two-part vision for Subversion and large binary objects.

Posted by Julian Foad <ju...@apache.org>.

On 2022-01-28 Evgeny Kotkov wrote:
> Julian Foad writes:
>> We could swap the scanning logic around to do the (quick) check for
>> missing pristines before deciding whether a (slower) file "stat" is
>> necessary. [...]
> 
> I might be missing something, but I don't yet see how we could save a stat().

I have now responded to this in the thread "[PATCH] Sketch of
per-user/per-wc config for pristines-mode", in a paragraph beginning "3.
Another alternative ...".

- Julian

Re: A two-part vision for Subversion and large binary objects.

Posted by Evgeny Kotkov <ev...@visualsvn.com>.

Julian Foad <ju...@apache.org> writes:

> We could swap the scanning logic around to do the (quick) check for
> missing pristines before deciding whether a (slower) file "stat" is
> necessary. Positives: eliminates the redundant "stat" overhead which may
> be significant in working trees containing many files. Negatives: some
> re-work needed in the current implementation.
>
> Of these, the last one currently looks viable and useful.
>
> Does that one look promising to you?

I might be missing something, but I don't yet see how we could save a stat().

Currently, a pristine is hydrated if and only if the corresponding working
file is modified.  Let's say we check if a pristine is hydrated beforehand.
If we find out that pristine is dehydrated, we have to stat(), because if the
file is modified, then we need to hydrate.  If we find out that pristine is
hydrated, we still have to stat(), because if the file is no longer modified,
then we need to dehydrate.

Thanks,
Evgeny Kotkov

Re: A two-part vision for Subversion and large binary objects.

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Julian Foad wrote on Fri, Jan 28, 2022 at 12:30:01 +0000:
> We could fetch pristines in a backgroud thread, making the "foreground"
> operation thread wait, just in time, for each pristine before accessing
> it. Positives: the end result is efficient. Negatives: we don't have
> precendent for threaded operations,

We have the svn_task_* API (subversion/include/private/svn_task.h:38),
although nothing seems to use it yet, on any branch.

We have subversion/include/private/svn_thread_cond.h to support it.

We use threads in the test suite:
subversion/tests/svn_test_main.c:do_tests_concurrently()

> and they can be tricky, so unknown and potentially large effort to
> complete it.

Even with the API building blocks in place, there's still effort
involved, of course.

Cheers,

Daniel

Re: A two-part vision for Subversion and large binary objects.

Posted by Julian Foad <ju...@apache.org>.

Thanks for your replies, Evgeny. Replying to the "status walk" part.

(TL;DR: Could we optimise by doing the db scan before the stat?)

I think we want to ensure, as far as possible,

  - no significant performance degradation if user does not opt in to
the new feature;
  - overhead when enabled should not be disproportionate to the operation.

We can achieve the former simply by skipping the scanning/syncing steps
entirely when the user has not chosen pristines-on-demand mode. That is fine.

About the overhead, you suggest the overhead does not seem too high, at
least for now, and maybe you are right. But it can be subjective: some
people have considered the status walk too costly in the past and have
dedicated a lot of effort to reducing it. I have some ideas about
reducing this overhead anyway.

Some ways we could potentially optimise when it's enabled are:

  - Fetch pristines in a backgroud thread;
  - Further limit the scan (by depth etc.);
  - Scan for missing pristines (quick db op) before statting files for mods.

We could fetch pristines in a backgroud thread, making the "foreground"
operation thread wait, just in time, for each pristine before accessing
it. Positives: the end result is efficient. Negatives: we don't have
precendent for threaded operations, and they can be tricky, so unknown
and potentially large effort to complete it.

We could further limit the scan (by depth etc.). Positives: easy to
implement some steps in this direction. Negatives: only ever gets
us closer, but never all the way, towards fetching only what is really
needed; and risk of introducing buggy cases where it fetches too little.

We could swap the scanning logic around to do the (quick) check for
missing pristines before deciding whether a (slower) file "stat" is
necessary. Positives: eliminates the redundant "stat" overhead which may
be significant in working trees containing many files. Negatives: some
re-work needed in the current implementation.

Of these, the last one currently looks viable and useful.

Does that one look promising to you?

- Julian

Re: A two-part vision for Subversion and large binary objects.

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Evgeny Kotkov wrote on Fri, Jan 28, 2022 at 01:58:43 +0300:
> Also, I tend to think that DRY doesn't really apply here, because a status
> walk and a textbase sync are essentially different operations that just
> happen to have something in common internally.  For example, a textbase
> sync doesn't have to follow the tree structure and can be implemented
> with an arbitrarily ordered walk over NODES.

If we iterate NODES in an arbitrary order, we'll lose the benefit of
cache locality of the OS filesystem's page cache.  To avoid that, we can
do a depth-first walk.  Is there a reason not to?

% sqlite3 .svn/wc.db  "SELECT local_relpath FROM nodes WHERE checksum IS NOT NULL ORDER BY checksum;" | head 
subversion/libsvn_diff/diff4.c
subversion/bindings/javahl/src/org/tigris/subversion/javahl/CommitMessage.java
notes/subversion-diagram.graffle
tools/dist/release.py
tools/hook-scripts/validate-extensions.py
subversion/mod_dav_svn/reports/log.c
tools/dev/iz/ff2csv.py
subversion/libsvn_fs_x/util.c
subversion/libsvn_ra/util.c
subversion/bindings/javahl/native/ExternalItem.hpp

Cheers,

Daniel

Re: A two-part vision for Subversion and large binary objects.

Posted by Evgeny Kotkov <ev...@visualsvn.com>.

Julian Foad <ju...@apache.org> writes:

> Scanning with 'stat'
>
> I'm concerned about the implementation scanning the whole subtree,
> calling 'stat' on every file to determine whether the file is "changed"
> (locally modified). This is done in svn_wc__textbase_sync() with its
> textbase_walk_cb().
>
> It does this scan on every sync, which is twice on every syncing
> operation such as diff.
>
> Don't we already have an optimised scan for local modifications
> implemented in the "status" code? Could we re-use this?

In a few of my experiments, performance of textbase_sync() was more or
less comparable to a status walk.  So maybe it's not actually worthwhile
spending time on improving this part, at least for now.

Also, I tend to think that DRY doesn't really apply here, because a status
walk and a textbase sync are essentially different operations that just
happen to have something in common internally.  For example, a textbase
sync doesn't have to follow the tree structure and can be implemented
with an arbitrarily ordered walk over NODES.

> Premature Hydrating
>
> The present implementation "hydrates" (fetches missing pristines) every
> file within the whole subtree the operation targets. This is done by
> every major client operation calling svn_client__textbase_sync() before
> and afterwards.
>
> That is pessimistic: the operation may not actually touch all these
> files if limited in any way such as by
>
>   - depth filtering
>   - other filtering (changelist, properties-only, ...)
>   - terminating early (e.g. output piped to 'head')
>
> That introduces all the fetching overhead for the given subtree as a
> latency before the operation shows its results, which for something
> small at the root of the tree such as "svn diff --depth=empty
> --properties-only ./" may make a significant usability impact.
>
> Presumably we could add the depth and some other kinds of filtering to
> the tree walk. But that will always leave terminating early, and
> possibly other cases, sub-optimal.
>
> I would prefer a solution that defers the hydrating until closer to the
> moment of demand.

I think that fetching the pristine contents at the moment of demand is a
particularly problematic concept to pursue, because it implies that there is
a network request that can now happen at an unpredictable moment of time.
So any operation that may access the pristine contents has to be ready for
a network fetch.  Compared to that, fetching the required pristines before
the operation does not impose that kind of requirement on the existing code.

Thanks,
Evgeny Kotkov

Re: A two-part vision for Subversion and large binary objects.

Posted by Julian Foad <ju...@apache.org>.

Vincent Lefevre wrote:
> Do you mean that "svn diff" at the root will fetch everything
> even if no files are modified?

No, only the pristines for files that are modified.

Re: A two-part vision for Subversion and large binary objects.

Posted by Vincent Lefevre <vi...@vinc17.net>.

On 2022-01-21 11:15:04 +0000, Julian Foad wrote:
> Premature Hydrating
> 
> The present implementation "hydrates" (fetches missing pristines) every
> file within the whole subtree the operation targets. This is done by
> every major client operation calling svn_client__textbase_sync() before
> and afterwards.
> 
> That is pessimistic: the operation may not actually touch all these
> files if limited in any way such as by
> 
>   - depth filtering
>   - other filtering (changelist, properties-only, ...)
>   - terminating early (e.g. output piped to 'head')
> 
> That introduces all the fetching overhead for the given subtree as a
> latency before the operation shows its results, which for something
> small at the root of the tree such as "svn diff --depth=empty
> --properties-only ./" may make a significant usability impact.

Do you mean that "svn diff" at the root will fetch everything
even if no files are modified?

-- 
Vincent Lefèvre <vi...@vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

Re: A two-part vision for Subversion and large binary objects.

Posted by Nathan Hartman <ha...@gmail.com>.

On Mon, Jan 31, 2022 at 6:41 AM Johan Corveleyn <jc...@gmail.com> wrote:
>
> Replying to a few different points in this thread.
>
> On Jan 27, Julian Foad wrote:
> > The user can choose one mode, per WC, from a list of options that may include:
> >
> >   - off: as in previous versions, no checking, just assume all pristines
> > are present
> >   - pristines-on-demand: fetch "wanted" pristines; discard others
> >   - fetch-only: fetch any pristine that's absent; do not discard
>
> I think, whatever the name of the property here, "off" is confusing
> for wanting all pristines (the back-compat / old / default (?)
> behaviour).
>
> To me it sounds like I am setting the feature "do not fetch all
> pristines" to off, so "please fetch all pristines" with a double
> negation.

+1 to avoiding the double negative.

I did previously speak in terms of the (pristines-on-demand) feature
being turned on or off, but I was thinking from the perspective of a
proposed new option which shouldn't take effect unless the user opts
in. Imagining it from the perspective of a future user after the
feature exists, this is a double negative and is indeed confusing.

More inline responses below...

> Hmm. I think the pristine-fetching strategy that is chosen for a
> particular working copy should a property of that working copy. That's
> because it has a "persistent" impact on that working copy. Changing
> that strategy (if we would support that) severely impacts the disk
> layout of that particular working copy. It's not just a runtime thing,
> like using "exclusive sqllite locks" or some such (leaves no trace for
> the next user).
>
> If it would be a runtime setting, and Alice and Bob would both work on
> the same working copy, and the former has "pristine-fetching=full" and
> the latter "pristine-fetching=lazy" (or some detailed strategy with
> patterns, whatever), the working copy would be changed severely every
> time one or the other touches it.

+1. Since pristines are part of a particular working copy, the
variables that indicate whether they are present and how/when they are
fetched should be part of the working copy as well.

> So I think the chosen pristine-fetching strategy for a working copy
> should be stored in the WC itself, probably in wc.db.
>
> However, we would still need related runtime config options. But I see
> them as "defaults" for when creating a new working copy. Perhaps these
> belong in a [working-copy defaults] or [working-copy creation]
> section, as opposed to the [working-copy] section which is more about
> runtime behaviour.

That would be convenient; if a user wants pristines-on-demand all or
most of the time, this would save having to specify that for every
checkout. Is it mandatory for the first iteration? I don't know, but I
do think it would be very nice to have.

What *is* mandatory in my mind is an option for 'svn checkout'. The
user should not be locked in to a configured default because they may
(and perhaps are even likely to) have a mix of different repos they
checkout, perhaps from faster or slower servers, or perhaps some repos
with seldom-changing and others with frequently changing ones, etc.
The user knows which is the more common case and should configure that
as the default, then override it at checkout time for the less common
case.

Cheers,
Nathan

Re: A two-part vision for Subversion and large binary objects.

Posted by Johan Corveleyn <jc...@gmail.com>.

Replying to a few different points in this thread.

On Jan 27, Julian Foad wrote:
> The user can choose one mode, per WC, from a list of options that may include:
>
>   - off: as in previous versions, no checking, just assume all pristines
> are present
>   - pristines-on-demand: fetch "wanted" pristines; discard others
>   - fetch-only: fetch any pristine that's absent; do not discard

I think, whatever the name of the property here, "off" is confusing
for wanting all pristines (the back-compat / old / default (?)
behaviour).

To me it sounds like I am setting the feature "do not fetch all
pristines" to off, so "please fetch all pristines" with a double
negation.

Maybe we should go for something like:
pristine-fetching = full (or "eager", or "all", i.e. default) | lazy
(or "on-demand")

Perhaps with a third option "lazy-keep" (like your "fetch-only"),
indicating on-demand, but not immediately cleaning it after use (don't
know if this would be useful -- could be added later of course). Or
"lazy-transient" for the "lazy with immediate cleaning after use" as
opposed to "lazy" (which keeps fetched pristines once fetched).

On Sat, Jan 29, 2022 at 9:22 AM Julian Foad <ju...@apache.org> wrote:
>
> Vincent Lefevre wrote:
> >> [...] Specifying a pattern to match the WC path [or] per repository [...]
> >
> >But what if a WC can be accessed from different machines [...]?
>
> Then:
> - The config option should be designed never to assume or depend on the pristine store being in a particular state (such as fully populated).
> - The user might want different behaviour on different machines, or the same on all.
> - The patch I posted yesterday in a separate thread allows the user to set the config option in the user config or per-wc config.
> - I noticed we already have some other config options in the '[working-copy]' config section. We probably should allow the user to set those per-wc too.
> - Julian

Hmm. I think the pristine-fetching strategy that is chosen for a
particular working copy should a property of that working copy. That's
because it has a "persistent" impact on that working copy. Changing
that strategy (if we would support that) severely impacts the disk
layout of that particular working copy. It's not just a runtime thing,
like using "exclusive sqllite locks" or some such (leaves no trace for
the next user).

If it would be a runtime setting, and Alice and Bob would both work on
the same working copy, and the former has "pristine-fetching=full" and
the latter "pristine-fetching=lazy" (or some detailed strategy with
patterns, whatever), the working copy would be changed severely every
time one or the other touches it.

So I think the chosen pristine-fetching strategy for a working copy
should be stored in the WC itself, probably in wc.db.

However, we would still need related runtime config options. But I see
them as "defaults" for when creating a new working copy. Perhaps these
belong in a [working-copy defaults] or [working-copy creation]
section, as opposed to the [working-copy] section which is more about
runtime behaviour.

Basically:
  - The chosen pristine-fetching strategy should be a persistent
property of the WC, to be chosen at creation time.
  - Defaults for this should be part of our runtime-config area (and
perhaps also options for 'svn checkout').
  - We might introduce ways to change the setting of a given WC (but
it's not a must have for the first iteration, I guess)

On Fri, Jan 28, 2022 at 6:11 PM Evgeny Kotkov
<ev...@visualsvn.com> wrote:
>
> Julian Foad <ju...@apache.org> writes:
>
> > We could swap the scanning logic around to do the (quick) check for
> > missing pristines before deciding whether a (slower) file "stat" is
> > necessary. Positives: eliminates the redundant "stat" overhead which may
> > be significant in working trees containing many files. Negatives: some
> > re-work needed in the current implementation.
> >
> > Of these, the last one currently looks viable and useful.
> >
> > Does that one look promising to you?
>
> I might be missing something, but I don't yet see how we could save a stat().
>
> Currently, a pristine is hydrated if and only if the corresponding working
> file is modified.  Let's say we check if a pristine is hydrated beforehand.
> If we find out that pristine is dehydrated, we have to stat(), because if the
> file is modified, then we need to hydrate.  If we find out that pristine is
> hydrated, we still have to stat(), because if the file is no longer modified,
> then we need to dehydrate.

What seems important to me is that, if a WC is set to
pristine-fetching=full (or SVN 1.15 would be working with an older
working copy, without the "pristine-fetching" property (say if
upgrading of the wc format would not be needed)), that the code here
can assume (like the old code) that the pristine is present. No need
for an extra stat, just assume it is there, if not, error out like
today (incidentally, this is a very annoying error to run into (if a
pristine is accidentally deleted for some reason), because a user
cannot recover from it, the WC is hosed -- but that is another issue).

So we introduce no extra overhead for "fully fetched / old style" WCs.

Of course, keeping the extra overhead as low as possible for
"pristine-fetching=lazy" WCs is an important consideration too (but
I'm afraid I can't contribute anything useful there :-), haven't dug
in deep enough).

-- 
Johan

Re: A two-part vision for Subversion and large binary objects.

Posted by Julian Foad <ju...@apache.org>.

Vincent Lefevre wrote:
>> [...] Specifying a pattern to match the WC path [or] per repository [...]
>
>But what if a WC can be accessed from different machines [...]?

Then:
- The config option should be designed never to assume or depend on the pristine store being in a particular state (such as fully populated).
- The user might want different behaviour on different machines, or the same on all.
- The patch I posted yesterday in a separate thread allows the user to set the config option in the user config or per-wc config.
- I noticed we already have some other config options in the '[working-copy]' config section. We probably should allow the user to set those per-wc too.
- Julian

Re: A two-part vision for Subversion and large binary objects.

Posted by Vincent Lefevre <vi...@vinc17.net>.

On 2022-01-27 17:21:42 +0000, Julian Foad wrote:
> This setting doesn't have to be persistent in the WC. It could be
> configured in client run-time config instead (e.g.
> ~/.subversion/config), as we previously mentioned.
> 
> If it's stored in the WC then we need to create some new UI to control
> the setting. I am not sure we want to do so just now. It does seem, if
> we were designing svn from scratch, such a setting would ideally be
> remembered in the WC and there would be UI to control it, analogous to
> "git config --system|--global|--local", but we are not there.
> 
> When we were thinking the setting would be of the form "on for all files
> larger than X" then the runtime config seemed more appropriate, as that
> form might be expected to apply to many WCs, possibly adding conditions
> such as "and path to WC matches Y" or "repository matches Z". Specifying
> the WC path is ugly as WCs can be moved and we haven't ever exposed any
> other identifier for a WC. Specifying a pattern to match the WC path is
> better. Specifying it per repository is very logical because the
> behavior is so dependent on the repo connection.

But what if a WC can be accessed from different machines (e.g. via
NFS or SSHFS), so potentially with different ~/.subversion/config
files? And what if a WC is stored on a USB drive/disk, which can
move to various machines?

-- 
Vincent Lefèvre <vi...@vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

Re: A two-part vision for Subversion and large binary objects.

Posted by Julian Foad <ju...@foad.me.uk>.

Julian Foad wrote:
> I'm writing an initial patch to let the user control pristines-on-demand
> on and off per WC.

This setting doesn't have to be persistent in the WC. It could be
configured in client run-time config instead (e.g.
~/.subversion/config), as we previously mentioned.

If it's stored in the WC then we need to create some new UI to control
the setting. I am not sure we want to do so just now. It does seem, if
we were designing svn from scratch, such a setting would ideally be
remembered in the WC and there would be UI to control it, analogous to
"git config --system|--global|--local", but we are not there.

When we were thinking the setting would be of the form "on for all files
larger than X" then the runtime config seemed more appropriate, as that
form might be expected to apply to many WCs, possibly adding conditions
such as "and path to WC matches Y" or "repository matches Z". Specifying
the WC path is ugly as WCs can be moved and we haven't ever exposed any
other identifier for a WC. Specifying a pattern to match the WC path is
better. Specifying it per repository is very logical because the
behavior is so dependent on the repo connection.

Any good ideas anyone?

- Julian

Re: A two-part vision for Subversion and large binary objects.

Posted by Evgeny Kotkov <ev...@visualsvn.com>.

Julian Foad <ju...@apache.org> writes:

> SEMANTICS
>
> The user can choose one mode, per WC, from a list of options that may
> include:
>
>   - off: as in previous versions, no checking, just assume all pristines
> are present
>   - pristines-on-demand: fetch "wanted" pristines; discard others
>   - fetch-only: fetch any pristine that's absent; do not discard

My two cents on this would be that it might be easier to start with an on/off
property that gets stored when a working copy is created and doesn't change
afterwards, at least for now.

> Help please: Where should we properly store this setting in the WC?
>
> - in '.svn/entries' or '.svn/format'?
>   (Both currently contain a single line saying "12". We could add an
> extra line, or in general N extra lines each with one setting, for example.)
> - in a new file such as '.svn/pristines-on-demand'?
> - in the wc.db somewhere?

Thinking out loud, this sounds like a property associated with a specific
wc_id in the database.  I would say that this pretty much rules out options
of storing it outside the wc.db.

Thanks,
Evgeny Kotkov

Re: A two-part vision for Subversion and large binary objects.

Posted by Julian Foad <ju...@apache.org>.

I'm writing an initial patch to let the user control pristines-on-demand
on and off per WC.

Here I will assume:

  - The user needs a way to populate all pristines throughout the whole
WC when they want to, for example before they go offline.

At first I thought we would have an option value that represents the
state where all pristines are definitely present, as in older formats.
But we want a setting that is under direct control of the user, not a
state that Subversion reports from an API (although we might want that
as well). The user may change the setting at any time, when some
pristines may be absent, so there is no "all pristines are present right
now" mode setting.


SEMANTICS

The user can choose one mode, per WC, from a list of options that may include:

  - off: as in previous versions, no checking, just assume all pristines
are present
  - pristines-on-demand: fetch "wanted" pristines; discard others
  - fetch-only: fetch any pristine that's absent; do not discard

If the user wishes to ensure all pristines in the WC are present, they
can set the "fetch-only" mode and then run some svn command that fetches
all missing pristines.

If the user wishes to change a WC from having all pristines present to
pristines-on-demand, keeping pristines only for files that are currently
modified and discarding the rest, they can set the "pristines-on-demand"
mode and then run some svn command that discards all "unwanted" pristines.

In both of those cases, it doesn't particularly matter which command the
user has to run, just so long as we ensure there is one that we can recommend.

Additional options are possible such as:

  - off-line: use existing pristines; do not fetch or discard

The "off" and "off-line" modes imply basically the same behaviour; where
they differ is in the expectation that all pristines are present when we
choose "off" mode. I am not yet sure if we will want to keep such a
distinction in the end.


PERFORMANCE

The current pristines-on-demand' branch implementation does two scans of
the given WC subtree, one before and one after certain operations, as I
mentioned before.

  - "pristines-on-demand" mode: these scans are needed.

  - "off" and "off-line" modes: these can be skipped entirely.

  - "fetch-only" mode: the scan after the operation can be skipped,
while the scan before it will still be performed, even when all the
pristines (at least in the subtree) are already present.

Are we going to need to optimise until the cost is negligible at least
when pristines are all present, so that the user would never have need
to turn the feature "off" completely to match current performance?


IMPLEMENTATION

My patch initially uses a file '.svn/pristines-on-demand' as a
place-holder for wherever we might choose to store the setting properly
(in wc.db for example).

Help please: Where should we properly store this setting in the WC?

- in '.svn/entries' or '.svn/format'?
  (Both currently contain a single line saying "12". We could add an
extra line, or in general N extra lines each with one setting, for example.)
- in a new file such as '.svn/pristines-on-demand'?
- in the wc.db somewhere?

Do we have any precedent of user controlled settings in the WC? I can't
find any.

- Julian

Re: A two-part vision for Subversion and large binary objects.

Posted by Julian Foad <ju...@apache.org>.

In the messages I'm now replying to, basically we were debating details
of some potential use cases, in the context of how far a per-WC control
might or might not be adequate over the range of possible cases.

I started drafting a point-by-point reply but I think we may better use
our time agreeing that some cases would benefit from finer grained
control but we're not yet in a position to quantify it.

Just some selected inline responses:

Daniel Shahaf wrote:
> [...]
> Haven't you just moved your goalposts?  I quote:

I don't think so. I could have been unclear. The phrases "not 'huge'"
and "minority of the pristine space" do not appear to conflict. The
double negatives can be confusing.

[...]

>> I don't dispute that some cases exist where it would be nice to have
>> per-file control. [...]
> 
> Hang on.  Why do you assume that if someone has big files, then they're
> necessarily all out in a one directory and [...]

I don't, and can't see where I implied that.

- Julian

Re: A two-part vision for Subversion and large binary objects.

Posted by Karl Fogel <kf...@red-bean.com>.

On 27 Jan 2022, Daniel Shahaf wrote:
>Hang on.  Why do you assume that if someone has big files, then 
>they're
>necessarily all out in a one directory and all the accompanying 
>texty
>(or otherwise diffable) files are all in another directory? 
>Sure,
>that's exactly kfogel's use-case (described upthread), but it's 
>not the
>only way to structure a repository.

FYI, that's not our company's use case.

In fact, we have large non-diffable files spread all over in many 
directories, and we have small-texty-diffable files also spread 
out over many directories, and both kinds of file often co-exist 
within the same directory.

Just correcting the record -- I don't think this one particular 
use case
is necessarily definitive for the feature or anything.  I just 
wanted to make sure that you have accurate information.

Best regards,
-Karl

Re: A two-part vision for Subversion and large binary objects.

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Julian Foad wrote on Tue, Jan 25, 2022 at 21:43:44 +0000:
> Daniel Shahaf wrote:
> > Julian Foad wrote on Thu, Jan 20, 2022 at 21:03:02 +0000:
> >> The only case in which a simple per-WC setting might be unsatisfactory
> >> is the following combination:
> >  
> > Why would it be the only case?
> 
> I assert that per-WC control suffices if any of the conditions I listed
> is false.

I understood the form of your argument; I just didn't understand why the
argument was correct.  Saying that the case you've outlined is the
_only_ one, that there isn't _any_ exception, is a non-trivial claim.
(For instance, that's exactly the claim to fame of the Ω(n log n) lower
bound on comparison-based sorting.)

> > I agree that that subset's pristines are necessarily able to be stored
> > locally at least from time to time, but no more than that.  It's not
> > _necessarily_ posssible to store those files' pristines permanently [...]
> 
> You rightly point out that cases may exist where the pristines-wanted
> subset is only needed some of the time, and the rest of the time it's
> important to recover that space for other uses. That implies the
> pristines-wanted subset is "huge" -- otherwise by definition the space
> they occupy would not be unacceptable to store permanently.
> 
> When you need those pristines, it would therefore be OK to disable
> pristines-on-demand for the whole WC, because that isn't hugely worse
> than if you could choose just the subset. (Saving a minority of
> the pristines space is not a driving requirement for this feature, even
> if it would be nice to have.)

Haven't you just moved your goalposts?  I quote:

> > >    - the WC data set is "huge" […] in total; and
> > >    - there is a subset of files […]
> > >    - that subset of files is not "huge" in total; and

The subset of the files was "not 'huge' in total" upthread and is
responsible for "a minority of the pristine space" here.  Which is it?
We can't agree on handling this use-case until we agree on what this
use-case is.

> In those cases, switching the WC between pristines-present and
> pristines-on-demand would be necessary. Such "switching" is probably a
> strong requirement anyway, even outside this case, as I should think it
> would be considered poor UX if it were not possible to change one's mind
> without a re-checkout.

Even if it's poor UX, we should still ask whether this poor UX would or
wouldn't be a good exchange for the engineering effort of implementing
toggleability.  That's comparable to how having «svn upgrade» and
«svnadmin upgrade» at all means there could be bugs that affect upgraded
wc's/repositories but not new ones.  (For extra fun, the bug could be
latent and only surface after a further in-place upgrade.  Debian has
had such bugs, e.g., <https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=620958#30>.)

> > Let me try to sketch a use-case for wanting only _some_ files to be
> > pristineless. [...]
> 
> I don't dispute that some cases exist where it would be nice to have
> per-file control. I still see it as merely "nice" and still do not see
> how it could be considered essential or very important.
> 

Hang on.  Why do you assume that if someone has big files, then they're
necessarily all out in a one directory and all the accompanying texty
(or otherwise diffable) files are all in another directory?  Sure,
that's exactly kfogel's use-case (described upthread), but it's not the
only way to structure a repository.

For instance, take our own /repos/asf/subversion/site/publish/download
and /repos/dist/release/subversion.  Those are separate repositories,
but Subversion (the software) does not dictate that.  If Infra had
decided to do things differently, to put the artifacts in the /site
directory, then only dev@ subscribers who participate in "… up for
signing/testing" threads would have had a reason to download the full
/site tree; everyone else (say, translators) would have needed only the
texty bits.  [That's actually a use-case for server-provided viewspecs;
and «svn checkout --depth=infinity» would override them…]

> It's not completely clear to me what you mean to draw out in your
> 'libsvn*.so' example. It seems to be a case where the user wants
> efficient 'commit' of a few files which are large enough to care about
> that operation (let's assume they are diffable enough for their
> pristines to be useful) -- but make up only a small subset of the total
> WC size so omitting pristines of the majority of the WC, which is huge,
> would be important to save space. Yes, that's a case where subset
> control would be nice.
> 
> But I would argue to that case, there are alternative and even better
> solutions than managing pristines. The user could make the WC shallow
> instead, omitting the pristines *and* working files of releases branches
> they don't currently need to work on while behind the narrow downlink.

My use-case involved a user who wished to have 1.13's binaries available
to them offline (so they could reproduce and prepare fixes on the road).
Your proposed workflow does not support the assumed user's workflow, so
I don't see how it is a "better solution".

When all is said and done, if you want to make a fuel-efficient ambulance,
you look into extracting more joules of mechanical work out of each
litre of fuel.  You don't just park the ambulance for a day a week.

> Or they could have their main WC pristine-less and check out a separate
> WC, with pristines, containing just the minority parts that they need offline.

This argument is also an argument for closing issue #525 as WONTFIX.
"Use «export» rather than «checkout» and keep a parallel depth=empty
working copy wherein you'll pull (using «svn update --parents») just the
files you'll need."  Want «svn status»?  Use lndir(1) and «find ./
-type f».  Want «svn diff»?  Use «zfs snapshot» and diff(1).  Want «svn
update»?  Just «export» again since we assume the network connection is
wide and cheap and the files are undeltifiable.

For the last one, if we don't assume a wide and cheap downlink, we can
think of adding an «svn update --rather-than-download-pristines,-copy-
the-following-file-into-the-pristine-store-if-its-sha1-matches-a-sha1-of-
a-file-the-server-says-it\'s-about-to-send-us=/some/local/path» option,
which would do what its name says.  It'd be similar to «rsync --copy-dest».

And if we do make that assumption, one could probably implement a FUSE
filesystem that fetches pristine files on-demand.  There's exactly such
a solution linked from #525 ("scord"), but it was never updated for
wc-ng (svn ≥1.7).

> > Which brings me to a less contrived / more general point: What if the
> > user _knows in advance_ they'll need a pristine?  Shouldn't there be: —
> >  
> > - a way to say "I'm about to change a large, diffable file; detranslate
> >  it into the pristine store before I touch it"?  Perhaps even make
> >  files read-only at the OS level (as with svn:needs-lock) [...]?
> 
> > - a way to say "[...] download a pristine for this file now"?
> 
> > - «svn commit --keep-pristines» [...]?
> 
> At one level these are some logical extensions to the control that users
> would have over the pristine-management process. These additional
> controls might be valuable in certain cases.
> 
> In the context of the main driving use cases (fast connectivity to the
> repo) these would be marginal tweaks with no real benefit. They could
> have real benefits in the scenarios that we looked at above where there
> is neither plenty space nor plenty connectivity, and when per-file
> control of pristines is available.
> 
> We should consider making sure the API exposes these operations to
> keep/fetch/store pristines so that they could potentially be added to
> the UI of clients later. The 'svn' client would not necessarily ever
> want to expose this degree of control: it's likely too much to add to
> the user's cognitive load. It seems more something that certain scripts
> and clients built for automation tasks might benefit from, so might make
> sense just in APIs and bindings.

Agree that we should keep these in mind when designing the API.  As to
whether these belong in svn(1), in the API, in tools/, or in third-party
tools, we can cross that bridge when we come to it.

Cheers,

Daniel

Re: A two-part vision for Subversion and large binary objects.

Posted by Julian Foad <ju...@apache.org>.

Replying to selected points from the last few messages.

Daniel Shahaf wrote:
> Julian Foad wrote on Thu, Jan 20, 2022 at 21:03:02 +0000:
>> The only case in which a simple per-WC setting might be unsatisfactory
>> is the following combination:
>  
> Why would it be the only case?

I assert that per-WC control suffices if any of the conditions I listed
is false.

> [...]
>>     - there is a subset of files on which the user needs to work
>> (requiring diffs, etc.) often enough that fetching their pristines "on
>> demand" is a problem; and
>  
> Disagree.  Why would fetching on-demand being a problem _necessarily_ be
> caused by an "often enough" need to work on some files?  Why couldn't
> on-demand fetching pristines be a problem for files that change once in
> a blue moon?

We agree. If even once is a problem in a particular case, then once
qualifies as "often enough" in that case. Maybe my wording or my lower
limit wasn't clear.

> For example, [...] some files that are large and diffable and
> may need to edited and diffed while behind a narrow downlink.

Yes, this is an example of the case I am describing where a simple
per-WC setting might be unsatisfactory.

>>     - that subset of files is not "huge" in total; and
>  
> I agree that that subset's pristines are necessarily able to be stored
> locally at least from time to time, but no more than that.  It's not
> _necessarily_ posssible to store those files' pristines permanently [...]

You rightly point out that cases may exist where the pristines-wanted
subset is only needed some of the time, and the rest of the time it's
important to recover that space for other uses. That implies the
pristines-wanted subset is "huge" -- otherwise by definition the space
they occupy would not be unacceptable to store permanently.

When you need those pristines, it would therefore be OK to disable
pristines-on-demand for the whole WC, because that isn't hugely worse
than if you could choose just the subset. (Saving a minority of
the pristines space is not a driving requirement for this feature, even
if it would be nice to have.)

In those cases, switching the WC between pristines-present and
pristines-on-demand would be necessary. Such "switching" is probably a
strong requirement anyway, even outside this case, as I should think it
would be considered poor UX if it were not possible to change one's mind
without a re-checkout.

>>     - that subset of files can be distinguished from the rest by metadata.
>  
> Why is this necessarily the case [...] this seems to rule
> out solutions that involve hardcoded lists (à la svn:ignore [...]

I meant any sort of metadata including such lists, but basically you're
right that this is not really relevant to describing the use case.

> Let me try to sketch a use-case for wanting only _some_ files to be
> pristineless. [...]

I don't dispute that some cases exist where it would be nice to have
per-file control. I still see it as merely "nice" and still do not see
how it could be considered essential or very important.

It's not completely clear to me what you mean to draw out in your
'libsvn*.so' example. It seems to be a case where the user wants
efficient 'commit' of a few files which are large enough to care about
that operation (let's assume they are diffable enough for their
pristines to be useful) -- but make up only a small subset of the total
WC size so omitting pristines of the majority of the WC, which is huge,
would be important to save space. Yes, that's a case where subset
control would be nice.

But I would argue to that case, there are alternative and even better
solutions than managing pristines. The user could make the WC shallow
instead, omitting the pristines *and* working files of releases branches
they don't currently need to work on while behind the narrow downlink.
Or they could have their main WC pristine-less and check out a separate
WC, with pristines, containing just the minority parts that they need offline.

> Which brings me to a less contrived / more general point: What if the
> user _knows in advance_ they'll need a pristine?  Shouldn't there be: —
>  
> - a way to say "I'm about to change a large, diffable file; detranslate
>  it into the pristine store before I touch it"?  Perhaps even make
>  files read-only at the OS level (as with svn:needs-lock) [...]?

> - a way to say "[...] download a pristine for this file now"?

> - «svn commit --keep-pristines» [...]?

At one level these are some logical extensions to the control that users
would have over the pristine-management process. These additional
controls might be valuable in certain cases.

In the context of the main driving use cases (fast connectivity to the
repo) these would be marginal tweaks with no real benefit. They could
have real benefits in the scenarios that we looked at above where there
is neither plenty space nor plenty connectivity, and when per-file
control of pristines is available.

We should consider making sure the API exposes these operations to
keep/fetch/store pristines so that they could potentially be added to
the UI of clients later. The 'svn' client would not necessarily ever
want to expose this degree of control: it's likely too much to add to
the user's cognitive load. It seems more something that certain scripts
and clients built for automation tasks might benefit from, so might make
sense just in APIs and bindings.

- Julian

Re: A two-part vision for Subversion and large binary objects.

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Julian Foad wrote on Wed, Jan 26, 2022 at 14:53:24 +0000:
> Daniel Shahaf wrote:
> > Julian Foad wrote on Fri, Jan 21, 2022 at 11:15:04 +0000:
> >> Premature Hydrating
> >>  
> >> The present implementation "hydrates" (fetches missing pristines) every
> >> file within the whole subtree the operation targets. [...]
> >  
> > Does it?  Looking at textbase_walk_cb(), it only sets REFERENCED to TRUE
> > for modified files.  [...]
> 
> I meant it fetches missing pristines that are deemed *wanted*, for all
> files within the tree. That limits it to modified files only, but all
> modified files (that don't yet have their pristines) not just those that
> will be touched by the operation.

Ah, it hydrates before depth and changelist filtering?

> > However, in cases such as «svn diff --diff-cmd=<GUI tool>», fetching the
> > pristines (too) close to the time they are needed could result in having
> > to reopen RA sessions.
> 
> What would be a problem with that?

More user-visible delays, need to authenticate again, …

> How is it different from existing long-running diff scenarios?

What scenarios do you have in mind? «svn diff --diff-cmd=<GUI tool>
URL1 URL2»?

In any case, the question isn't "Are we introducing a problem that
already exists [avoidably or otherwise] in other use-cases" but "Are we
introducing a problem that we can avoid introducing".

Cheers,

Daniel

Re: A two-part vision for Subversion and large binary objects.

Posted by Julian Foad <ju...@apache.org>.

Daniel Shahaf wrote:
> Julian Foad wrote on Fri, Jan 21, 2022 at 11:15:04 +0000:
>> I'm concerned about the implementation scanning the whole subtree,
>> calling 'stat' on every file [...]
>> Don't we already have an optimised scan for local modifications
>> implemented in the "status" code?
>  
> This? — [subversion/libsvn_wc/status.c]
> [...]
>  
> This still does a stat() on every file; how else would it obtain
> dirent->mtime?  It doesn't do open()/read()/memcmp().

The main point is DRY: maintain exactly one implementation so that both
the precise functionality and the performance optimisations are shared.

An example of the sort of thing the optimised code *could* potentially
do would be to obtain the mtimes of all the files in a directory in one
single read-dir call, depending on APR/OS/FS details.

TODO: deduplicate this walker.

>> Premature Hydrating
>>  
>> The present implementation "hydrates" (fetches missing pristines) every
>> file within the whole subtree the operation targets. [...]
>  
> Does it?  Looking at textbase_walk_cb(), it only sets REFERENCED to TRUE
> for modified files.  [...]

I meant it fetches missing pristines that are deemed *wanted*, for all
files within the tree. That limits it to modified files only, but all
modified files (that don't yet have their pristines) not just those that
will be touched by the operation.

> However, in cases such as «svn diff --diff-cmd=<GUI tool>», fetching the
> pristines (too) close to the time they are needed could result in having
> to reopen RA sessions.

What would be a problem with that? How is it different from existing
long-running diff scenarios?

- Julian

Re: A two-part vision for Subversion and large binary objects.

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Julian Foad wrote on Fri, Jan 21, 2022 at 11:15:04 +0000:
> Scanning with 'stat'
> 
> I'm concerned about the implementation scanning the whole subtree,
> calling 'stat' on every file to determine whether the file is "changed"
> (locally modified). This is done in svn_wc__textbase_sync() with its textbase_walk_cb().
> 
> It does this scan on every sync, which is twice on every syncing
> operation such as diff.
> 
> Don't we already have an optimised scan for local modifications
> implemented in the "status" code?

This? —

   [subversion/libsvn_wc/status.c]
   457	          /* If the on-disk dirent exactly matches the expected state
   458	             skip all operations in svn_wc__internal_text_modified_p()
   459	             to avoid an extra filestat for every file, which can be
   460	             expensive on network drives as a filestat usually can't
   461	             be cached there */
   462	          if (!info->has_checksum)
   463	            text_modified_p = TRUE; /* Local addition -> Modified */
   464	          else if (ignore_text_mods
   465	                  ||(dirent
   466	                     && info->recorded_size != SVN_INVALID_FILESIZE
   467	                     && info->recorded_time != 0
   468	                     && info->recorded_size == dirent->filesize
   469	                     && info->recorded_time == dirent->mtime))
   470	            text_modified_p = FALSE;

This still does a stat() on every file; how else would it obtain
dirent->mtime?  It doesn't do open()/read()/memcmp().

> Could we re-use this?

textbase_walk_cb() calls check_file_modified() which has a very similar
size-and-mtime check right at the top.  So, we already repeat the logic;
we just implement it twice.

Reuse would be nice, of course.  If nothing else, we could at least add
comments to the two locations cross-referencing them.

> Premature Hydrating
> 
> The present implementation "hydrates" (fetches missing pristines) every
> file within the whole subtree the operation targets. This is done by
> every major client operation calling svn_client__textbase_sync() before
> and afterwards.
> 

Does it?  Looking at textbase_walk_cb(), it only sets REFERENCED to TRUE
for modified files.  If I understand correctly textbase_walk_cb() and
the docstring of svn_wc__db_textbase_walk(), something along the lines of
.
    svn revert -R ./
    echo foo > subversion/tests/README
    svn diff
.
would fetch the pristine only for that one file, wouldn't it?

Sorry, I haven't got time to test this right now.

> That is pessimistic: the operation may not actually touch all these
> files if limited in any way such as by
> 
>   - depth filtering
>   - other filtering (changelist, properties-only, ...)
>   - terminating early (e.g. output piped to 'head')
> 
> That introduces all the fetching overhead for the given subtree as a
> latency before the operation shows its results, which for something
> small at the root of the tree such as "svn diff --depth=empty
> --properties-only ./" may make a significant usability impact.
> 
> Presumably we could add the depth and some other kinds of filtering to
> the tree walk. But that will always leave terminating early, and
> possibly other cases, sub-optimal.
> 
> I would prefer a solution that defers the hydrating until closer to the
> moment of demand.

Agree that from a UX perspective, it would be nice to avoid a long delay
at the start of an operation.

However, in cases such as «svn diff --diff-cmd=<GUI tool>», fetching the
pristines (too) close to the time they are needed could result in having
to reopen RA sessions.  In this case, perhaps it would make sense to
download the pristines in the background in a separate thread (at least
in case APR_HAS_THREADS)?

> Evgeny, have you looked into these possibilities at all? What are your
> thoughts about these?

Cheers,

Daniel

Re: A two-part vision for Subversion and large binary objects.

Posted by Julian Foad <ju...@apache.org>.

I have been studying when this implementation fetches pristines. Two
concerns about performance in the current implementation:

1. scanning the whole subtree, calling 'stat' on every file

2. premature hydrating


Scanning with 'stat'

I'm concerned about the implementation scanning the whole subtree,
calling 'stat' on every file to determine whether the file is "changed"
(locally modified). This is done in svn_wc__textbase_sync() with its textbase_walk_cb().

It does this scan on every sync, which is twice on every syncing
operation such as diff.

Don't we already have an optimised scan for local modifications
implemented in the "status" code? Could we re-use this?


Premature Hydrating

The present implementation "hydrates" (fetches missing pristines) every
file within the whole subtree the operation targets. This is done by
every major client operation calling svn_client__textbase_sync() before
and afterwards.

That is pessimistic: the operation may not actually touch all these
files if limited in any way such as by

  - depth filtering
  - other filtering (changelist, properties-only, ...)
  - terminating early (e.g. output piped to 'head')

That introduces all the fetching overhead for the given subtree as a
latency before the operation shows its results, which for something
small at the root of the tree such as "svn diff --depth=empty
--properties-only ./" may make a significant usability impact.

Presumably we could add the depth and some other kinds of filtering to
the tree walk. But that will always leave terminating early, and
possibly other cases, sub-optimal.

I would prefer a solution that defers the hydrating until closer to the
moment of demand.


Evgeny, have you looked into these possibilities at all? What are your
thoughts about these?

- Julian

Re: A two-part vision for Subversion and large binary objects.

Posted by Karl Fogel <kf...@red-bean.com>.

On 21 Jan 2022, Julian Foad wrote:
>Only for commands that need them, but, as mentioned above,
>pesimistically for every file that the command *possibly* 
>pertains to.
>I'll follow up on that.

*nod*  Will look for that, thanks.

>It will not fetch for 'commit' once I commit Evgeny's tiny patch 
>to make
>it so.

Great.

>>> The only case in which [that] might be unsatisfactory is [...]
>>>    - there is a subset of files on which the user needs to 
>>>    work [...]
>>> often enough that fetching their pristines "on demand" is a 
>>> problem;
>>> [...]
>>> It is not one of the cases driving this [...]
>> 
>> Well, that case is almost exactly our use case at my company 
>> :-),
>> except that I think fetching pristines on demand will be fine.
>
>Well, that's the crucial difference: in your case, fetching some
>pristines on demand sometimes is not a problem, so this solution 
>works.

Yep, agreed!

Best regards,
-Karl

Re: A two-part vision for Subversion and large binary objects.

Posted by Julian Foad <ju...@apache.org>.

Replying to the last three posts (Nathan, Karl, Johan).

Nathan wrote:
> The pristine for a given file is not fetched until the file is
> modified and an
> operation pertaining to it requires the pristine.

That's basically how the 'pristines-on-demand' branch is working.
However, the present implementation fetches ('hydrates') pristines for
every file within the whole subtree the operation targets. That is
pessimistic: the operation may not actually touch all these files if
limited in any way such as by depth filtering. I'll follow up with a
separate report on that.

> Once fetched, the pristine remains present until the file becomes unmodified
> through either 'commit' or 'revert'.

More precisely, the pristine remains until the file is once again
detected to be unmodified, which could happen by any means including
'commit' or 'revert' or being returned to its previous state by the user
outside of Subversion's control.

> [...] it occurred to me that if the file is deleted, the
> pristine (if present) should be deleted as well. (Potential caveat: if
> the file
> is modified, subsequently marked for deletion with '--keep-local', and
> subsequently the user runs 'svn revert', the expected result is to
> restore the original contents as they appear in BASE.)

The expected result for reverting any file marked as "deleted", no
matter whether it is present on disk, modified or not, is to return it
to pristine state.

I suppose your instinct is to not "waste space" once the user has
decided the file is no longer wanted, especially if it's a huge file and
the pristine isn't going to be useful for diffs etc.

That all makes sense once we commit the delete, but until then it's
considered a temporary working state that we might want to revert.

Because this file is in a state of local modification (in the general
sense that includes delete), I think the pristine should be kept, in
line with the simple and general policy that we should keep pristines
around for all files in a locally modified state. I understand we could
treat the "deleted" state specially, but not sure we should.

From what I understand so far of the current state of implementation, a
delete operation will first "hydrate" the textbase storage by copying
the file from working storage if it's present and unmodified, before
deleting it; and otherwise will mark it as "pristine needed" to be
fetched later on demand.

Karl wrote:
> And it only fetches pristine for commands that absolutely need
> pristine, I assume?  (I think you said earlier that it does not
> fetch for 'commit'.)

Only for commands that need them, but, as mentioned above,
pesimistically for every file that the command *possibly* pertains to.
I'll follow up on that.

It will not fetch for 'commit' once I commit Evgeny's tiny patch to make
it so.

>> The only case in which [that] might be unsatisfactory is [...]
>>    - there is a subset of files on which the user needs to work [...]
>> often enough that fetching their pristines "on demand" is a problem;
>> [...]
>> It is not one of the cases driving this [...]
> 
> Well, that case is almost exactly our use case at my company :-),
> except that I think fetching pristines on demand will be fine.

Well, that's the crucial difference: in your case, fetching some
pristines on demand sometimes is not a problem, so this solution works.

Johan wrote:
> Just as another data point, at my company we have a slightly different
> use case where this feature would be great (I think), and where a
> simple per-WC-yes/no switch would work fine. In this case, we'd
> probably also want a "system-wide runtime config area default".
> 
> The use case: [...] "I don't need 99.9% of
> those pristines most of the time, and it's blazingly fast to get them
> when needed".
> 
> Ideally, after pristines-on-demand become possible, I'd do the following:
> - Set a system-wide flag to make pristines-on-demand the default for
> new WC's.

Makes sense.

> - Run a script to "convert" all existing working copies to
> pristines-on-demand (setting the per-WC flag).

Agreed, we need to ensure there is a way to do this.

> - Run a script to "vacuum-pristines" those converted working copies.

Agreed, if the previous step doesn't do it.
(In the current implementation, 'svn upgrade' does this.)

> - Receive chocolates from our sysadmins (probably not, but I can try).

Hehe.

> (BTW: a lot of those working copies are similar, so a feature to have
> a "shared pristine store" would also help [...])

Agreed, that's an alternative with different pros and cons, that would
have approximately the same overall benefit in your case.

- Julian

Re: A two-part vision for Subversion and large binary objects.

Posted by Karl Fogel <kf...@red-bean.com>.

On 21 Jan 2022, Johan Corveleyn wrote:
>I like where this is going. Thanks to all involved for pushing it 
>forward :-).

You're welcome!  Thanks for the use-case example, too.

Any chance your company would want to join the consortium that is 
funding this work?  Please follow up with my privately if so (or 
introduce me to the right person if needed).

We have a few companies in the consortium already, but I'm always 
keeping an eye on budget and scope -- the more we pool our 
resources, the less it costs for each member and the more 
risk-resilient this overall effort becomes.

(I have a great deal of faith in the implementors, who are skilled 
and who know Subversion well.  But anyone who's managed projects 
will do whatever they reasonably can to reduce risk! :-) )

Best regards,
-Karl

>Just as another data point, at my company we have a slightly 
>different
>use case where this feature would be great (I think), and where a
>simple per-WC-yes/no switch would work fine. In this case, we'd
>probably also want a "system-wide runtime config area default".
>
>The use case: we have a couple of build machines for "official 
>full
>builds" (for test, staging, production, ... different stages) of 
>our
>major applications. To save time, for most big applications, we 
>keep
>re-using the large checked-out working copies (update, build, and
>switch to the next release branch if there is a next release). We 
>also
>use those same working copies for backporting cherrypick merges 
>to our
>release branches (it's all part of our release process, supported 
>by
>build scripts and procedures to do these things).
>
>So, currently, our largest build server has 400 such working 
>copies on
>disk, and they remain there "dormant" until someone updates, 
>builds,
>cherrypick-merges, commits-a-tag-from-WC, ... Totalling around 
>450 GB
>at the moment. Not an absolute killer, but this is a growing 
>number,
>and our sysadmins would be quite happy to see that number be +/-
>divided by 2 :-). Besides, these build servers are located in our 
>data
>centers, "right next to" the SVN server, i.e. having a 1 Gb or 10 
>Gb
>connection to the repository. A perfect fit for "I don't need 
>99.9% of
>those pristines most of the time, and it's blazingly fast to get 
>them
>when needed".
>
>Ideally, after pristines-on-demand become possible, I'd do the 
>following:
>- Set a system-wide flag to make pristines-on-demand the default 
>for new WC's.
>- Run a script to "convert" all existing working copies to
>pristines-on-demand (setting the per-WC flag).
>- Run a script to "vacuum-pristines" those converted working 
>copies.
>- Receive chocolates from our sysadmins (probably not, but I can 
>try).
>
>(BTW: a lot of those working copies are similar, so a feature to 
>have
>a "shared pristine store" would also help, but that's another 
>feature
>altogether, and perhaps much more difficult to get right -- both
>features would solve the wasted-disk-storage problem here, so I'm
>happy either way)
>
>Kind regards,

Re: Sharing .svn directories across wc's (was: Re: A two-part vision for Subversion and large binary objects.)

Posted by Branko Čibej <br...@apache.org>.

On 24.01.2022 14:39, Daniel Shahaf wrote:
> Daniel Sahlberg wrote on Mon, Jan 24, 2022 at 14:13:26 +0100:
>> Den mån 24 jan. 2022 kl 14:02 skrev Daniel Shahaf <d....@daniel.shahaf.name>:
>>
>>> As to what it'll take to actually implement this… I'm not sure.  If
>>> someone went in and changed «mkdir(".svn")» to «symlink("/well/known/path",
>>> ".svn")», would things Just Work™, or?
>>>
>> There are OSes where the support for symlinks are, let's say, less than
>> perfect.
> And?

WC_ID is hardcoded to 1 pretty much everywhere. There'd be a bit of work 
to make WC identification explicit.

Also the meaning of 'svn upgrade' in this context becomes ... interesting.


> I didn't propose symlinks as a solution.  I only asked about them in
> order to identify the blockers to implementing shared .svn dirs: whether
> it's as simple as needing to invent a «--dotsvn-dir» option, or is there
> more work needed on, say, identifying SQLite queries that compare
> LOCAL_RELPATH without also comparing WC_ID.


That would be most of the queries.


>
>> See for example issue SVN-3570. Of course, when we solve #3570,
>> then the shared wc.db would be an easy fix ;-)
> Or we could deal with the symlink on Windows the same way we deal with
> versioned symlinks on Windows: by creating a file ".svn" with content
> "link /path/to/some/where".


Or we could try not to invent platform-specific solutions for something 
like this. *shudder*

-- Brane

Re: Sharing .svn directories across wc's (was: Re: A two-part vision for Subversion and large binary objects.)

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Daniel Sahlberg wrote on Mon, Jan 24, 2022 at 14:13:26 +0100:
> Den mån 24 jan. 2022 kl 14:02 skrev Daniel Shahaf <d....@daniel.shahaf.name>:
> 
> > As to what it'll take to actually implement this… I'm not sure.  If
> > someone went in and changed «mkdir(".svn")» to «symlink("/well/known/path",
> > ".svn")», would things Just Work™, or?
> >
> 
> There are OSes where the support for symlinks are, let's say, less than
> perfect.

And?

I didn't propose symlinks as a solution.  I only asked about them in
order to identify the blockers to implementing shared .svn dirs: whether
it's as simple as needing to invent a «--dotsvn-dir» option, or is there
more work needed on, say, identifying SQLite queries that compare
LOCAL_RELPATH without also comparing WC_ID.

> See for example issue SVN-3570. Of course, when we solve #3570,
> then the shared wc.db would be an easy fix ;-)

Or we could deal with the symlink on Windows the same way we deal with
versioned symlinks on Windows: by creating a file ".svn" with content
"link /path/to/some/where".

Cheers,

Daniel

Re: Sharing .svn directories across wc's (was: Re: A two-part vision for Subversion and large binary objects.)

Posted by Daniel Sahlberg <da...@gmail.com>.

Den mån 24 jan. 2022 kl 14:02 skrev Daniel Shahaf <d....@daniel.shahaf.name>:

> As to what it'll take to actually implement this… I'm not sure.  If
> someone went in and changed «mkdir(".svn")» to «symlink("/well/known/path",
> ".svn")», would things Just Work™, or?
>

There are OSes where the support for symlinks are, let's say, less than
perfect. See for example issue SVN-3570. Of course, when we solve #3570,
then the shared wc.db would be an easy fix ;-)

Kind regards,
/Daniel

Sharing .svn directories across wc's (was: Re: A two-part vision for Subversion and large binary objects.)

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Johan Corveleyn wrote on Fri, Jan 21, 2022 at 10:12:57 +0100:
> (BTW: a lot of those working copies are similar, so a feature to have
> a "shared pristine store" would also help, but that's another feature
> altogether, and perhaps much more difficult to get right -- both
> features would solve the wasted-disk-storage problem here, so I'm
> happy either way)

The ≥1.7-era libsvn_wc ("wc-ng") was specifically designed to support
having a single .svn/wc.db serve multiple wc's.  That's why we have the
«wc_id» column in the NODES table.

As to what it'll take to actually implement this… I'm not sure.  If
someone went in and changed «mkdir(".svn")» to «symlink("/well/known/path",
".svn")», would things Just Work™, or?

Cheers,

Daniel

Re: A two-part vision for Subversion and large binary objects.

Posted by Johan Corveleyn <jc...@gmail.com>.

On Fri, Jan 21, 2022 at 5:56 AM Karl Fogel <kf...@red-bean.com> wrote:
>
> On 20 Jan 2022, Julian Foad wrote:
> >The more I think about this, the more I think we are prematurely
> >complicating the requirements in this respect. I'm going to
> >back-track
> >and posit that a simple per-WC switch should suffice for the vast
> >majority of cases, and has the benefit of simplicity. (The user
> >might
> >wish to set this based on the repository location -- local/fast
> >versus remote/slow.)
>
> Personally I'd be very happy to start with this.  We can always
> improve the client-side UI for the feature more in the future.
>
> >I will note that I previously misunderstood the current
> >'pristines-on-demand' implementation as fetching the pristine
> >before a
> >diff (for example) and discarding it afterwards.  In fact it
> >keeps the
> >pristine as long as the file in question remains in a locally
> >modified
> >state, and only discards the pristine when (before or after some
> >client
> >operation) the file is no longer in a modified state. That is to
> >say, it
> >fetches pristines less often than I had thought.
>
> And it only fetches pristine for commands that absolutely need
> pristine, I assume?  (I think you said earlier that it does not
> fetch for 'commit'.)
>
> I like the trick of keeping the pristine, once fetched, for as
> long as the file is locally modified.
>
> >The only case in which a simple per-WC setting might be
> >unsatisfactory
> >is the following combination:
> >
> >  - the repository is "slow" (and/or offline working is
> >  required);
> >
> >  - and, in a single WC:
> >    - the WC data set is "huge" (relative to local disk space) in
> >    total; and
> >    - there is a subset of files on which the user needs to work
> >(requiring diffs, etc.) often enough that fetching their
> >pristines "on
> >demand" is a problem; and
> >    - that subset of files is not "huge" in total; and
> >    - that subset of files can be distinguished from the rest by
> >    metadata.
> >
> >That is certainly a possible case, but we have no suggestion that
> >it is
> >at all common. It is not one of the cases driving this
> >feature. So I
> >think it is not something to design for at this stage.
>
> Well, that case is almost exactly our use case at my company :-),
> except that I think fetching pristines on demand will be fine.
> Thus, we can live with a per-WC setting.
>
> >I'm going to work on getting something more basic (per-WC yes/no)
> >closer
> >to production-ready and then we can re-assess it.
>
> Sounds good!
>
> Best regards,
> -Karl

I like where this is going. Thanks to all involved for pushing it forward :-).

Just as another data point, at my company we have a slightly different
use case where this feature would be great (I think), and where a
simple per-WC-yes/no switch would work fine. In this case, we'd
probably also want a "system-wide runtime config area default".

The use case: we have a couple of build machines for "official full
builds" (for test, staging, production, ... different stages) of our
major applications. To save time, for most big applications, we keep
re-using the large checked-out working copies (update, build, and
switch to the next release branch if there is a next release). We also
use those same working copies for backporting cherrypick merges to our
release branches (it's all part of our release process, supported by
build scripts and procedures to do these things).

So, currently, our largest build server has 400 such working copies on
disk, and they remain there "dormant" until someone updates, builds,
cherrypick-merges, commits-a-tag-from-WC, ... Totalling around 450 GB
at the moment. Not an absolute killer, but this is a growing number,
and our sysadmins would be quite happy to see that number be +/-
divided by 2 :-). Besides, these build servers are located in our data
centers, "right next to" the SVN server, i.e. having a 1 Gb or 10 Gb
connection to the repository. A perfect fit for "I don't need 99.9% of
those pristines most of the time, and it's blazingly fast to get them
when needed".

Ideally, after pristines-on-demand become possible, I'd do the following:
- Set a system-wide flag to make pristines-on-demand the default for new WC's.
- Run a script to "convert" all existing working copies to
pristines-on-demand (setting the per-WC flag).
- Run a script to "vacuum-pristines" those converted working copies.
- Receive chocolates from our sysadmins (probably not, but I can try).

(BTW: a lot of those working copies are similar, so a feature to have
a "shared pristine store" would also help, but that's another feature
altogether, and perhaps much more difficult to get right -- both
features would solve the wasted-disk-storage problem here, so I'm
happy either way)

Kind regards,
-- 
Johan

Re: A two-part vision for Subversion and large binary objects.

Posted by Karl Fogel <kf...@red-bean.com>.

On 20 Jan 2022, Julian Foad wrote:
>The more I think about this, the more I think we are prematurely
>complicating the requirements in this respect. I'm going to 
>back-track
>and posit that a simple per-WC switch should suffice for the vast
>majority of cases, and has the benefit of simplicity. (The user 
>might
>wish to set this based on the repository location -- local/fast 
>versus remote/slow.)

Personally I'd be very happy to start with this.  We can always 
improve the client-side UI for the feature more in the future.

>I will note that I previously misunderstood the current
>'pristines-on-demand' implementation as fetching the pristine 
>before a
>diff (for example) and discarding it afterwards.  In fact it 
>keeps the
>pristine as long as the file in question remains in a locally 
>modified
>state, and only discards the pristine when (before or after some 
>client
>operation) the file is no longer in a modified state. That is to 
>say, it
>fetches pristines less often than I had thought.

And it only fetches pristine for commands that absolutely need 
pristine, I assume?  (I think you said earlier that it does not 
fetch for 'commit'.)

I like the trick of keeping the pristine, once fetched, for as 
long as the file is locally modified. 

>The only case in which a simple per-WC setting might be 
>unsatisfactory
>is the following combination:
>
>  - the repository is "slow" (and/or offline working is 
>  required);
>
>  - and, in a single WC:
>    - the WC data set is "huge" (relative to local disk space) in 
>    total; and
>    - there is a subset of files on which the user needs to work
>(requiring diffs, etc.) often enough that fetching their 
>pristines "on
>demand" is a problem; and
>    - that subset of files is not "huge" in total; and
>    - that subset of files can be distinguished from the rest by 
>    metadata.
>
>That is certainly a possible case, but we have no suggestion that 
>it is
>at all common. It is not one of the cases driving this 
>feature. So I
>think it is not something to design for at this stage.

Well, that case is almost exactly our use case at my company :-), 
except that I think fetching pristines on demand will be fine. 
Thus, we can live with a per-WC setting.

>I'm going to work on getting something more basic (per-WC yes/no) 
>closer
>to production-ready and then we can re-assess it.

Sounds good!

Best regards,
-Karl

Re: A two-part vision for Subversion and large binary objects.

Posted by Nathan Hartman <ha...@gmail.com>.

On Mon, Jan 24, 2022 at 11:50 AM Karl Fogel <kf...@red-bean.com> wrote:

> On 24 Jan 2022, Daniel Shahaf wrote:
>
> >Which brings me to a less contrived / more general point: What if
> >the user _knows in advance_ they'll need a pristine?  Shouldn't
> >there be: —  - a way to say "I'm about to change a large,
> >diffable file; detranslate
> >  it into the pristine store before I touch it"?  Perhaps even
> >  make files read-only at the OS level (as with svn:needs-lock)
> >  so the user doesn't modify the file accidentally until its
> >  pristine has been set aside?
>
> 'svn hydrate'?  (I can't even tell if I'm joking.)

Suppose that in the future we get a "history depth" feature. I vaguely
recall a discussion about that. It is possible we even have a feature
request filed. In any event the idea of history depth is like that of
directory tree depth, but for locally cached history. To illustrate, SVN
today always has a history depth of one, meaning that pristines serve as a
local cache of BASE. Pristines-on-demand can be seen as a history depth of
zero, meaning no local cache (until needed for specific files). A future
history depth feature may make it possible to locally cache the last n
revisions, or even infinity (full DVCS).

Perhaps the command to hydrate/dehydrate pristines should be designed with
this possibility in mind, i.e., not limit it only to a true/false value.

E.g.,

# "Normal" pristines
$ svn update --set-history-depth=immediates .

# Pristines-on-demand:
$ svn update --set-history-depth=none .

With the limitation, currently, that it may be applied only to the wc root
since the underlying logic is currently a wc-wide on/off switch.

Cheers,
Nathan

Re: A two-part vision for Subversion and large binary objects.

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Karl Fogel wrote on Tue, Jan 25, 2022 at 01:22:07 -0600:
> On 25 Jan 2022, Daniel Shahaf wrote:
> > We _could_ make them in a way that doesn't require us to provide
> > compatibility for them forever, such as by releasing them as
> > "experimental" (cf.
> > https://subversion.apache.org/docs/release-notes/1.10#shelving),
> > by releasing an alpha or a nightly and soliciting feedback for
> > that, or by prototyping in Python what can be so prototyped.
> 
> We could, but I also have the feeling that after a while (a few months?)  of
> usage of the basic implementation, we'll all have a pretty good idea of what
> improvements would be most helpful.

Sure.  I was just thinking that getting the functionality into a tarball
would mean more people would be able to test it.

> > (Aside: "explicitly-hydrated" is a bit of a mouthful.  I considered just
> > referring to these as "somatic" and "autonomous" pristines…)
> 
> Yes... that would be so... clarifying...  <ahem>  :-)

I know; that's why I didn't use those terms.  But they _would_ have been
very greppable :)

Cheers,

Daniel

Re: A two-part vision for Subversion and large binary objects.

Posted by Karl Fogel <kf...@red-bean.com>.

On 25 Jan 2022, Daniel Shahaf wrote:
>Karl Fogel wrote on Mon, Jan 24, 2022 at 12:35:10 -0600:
>> I'm partly just thinking out loud here, to stimulate us all to 
>> think.  None
>> of this affects the initial, whole-WC implementation, and of 
>> course let's
>> keep in mind that the *main* use case will be already 
>> well-served by that
>> initial implementation.  These further improvements are for the 
>> future, and
>> perhaps we shouldn't even make them until we've all had some 
>> experience with
>> the initial simple UI.
>
>+1 to every single sentence of this paragraph.

*whew*

>We _could_ make them in a way that doesn't require us to provide
>compatibility for them forever, such as by releasing them as
>"experimental" 
>(cf. https://subversion.apache.org/docs/release-notes/1.10#shelving),
>by releasing an alpha or a nightly and soliciting feedback for
>that, or by prototyping in Python what can be so prototyped.

We could, but I also have the feeling that after a while (a few 
months?)  of usage of the basic implementation, we'll all have a 
pretty good idea of what improvements would be most helpful.

>(Aside: "explicitly-hydrated" is a bit of a mouthful.  I 
>considered just
>referring to these as "somatic" and "autonomous" pristines…)

Yes... that would be so... clarifying...  <ahem>  :-)

Best regards,
-Karl

Re: A two-part vision for Subversion and large binary objects.

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Karl Fogel wrote on Mon, Jan 24, 2022 at 12:35:10 -0600:
> On 24 Jan 2022, Daniel Shahaf wrote:
> > > > - «svn commit --keep-pristines», in case Alice has two logical
> > > > changes that she'd like to make in separate commits?
> > > 
> > > Maybe, or maybe one just uses 'svn dehydrate' ('svn hydrate
> > > --dehydrate' :-)
> > > ) when one is done working on the file.
> > 
> > I figured «svn commit --keep-pristines» could be used not only after
> > manually hydrating but also after implicit hydrating (e.g., after
> > «echo foo >> iota && svn diff iota»).
> 
> Before implementing these options (which obviously won't happen in the first
> iteration anyway), we should think carefully about naming and about how much
> of the underlying implementation detail we want to expose.

+1; I thought so too, just didn't say so explicitly.

> And there are broader UI/UX questions:
> 
> Maybe once someone starts working on a (large) file, they are likely to
> continue working on it.  In which case, we should keep the pristine until
> told to drop it.  Then 'svn cleanup' could have an additional behavior:
> "remove pristine for any unmodified file that would *normally* not have a
> pristine (except that the user manually caused it to have a pristine due to
> some special action or circumstance)".
> 
> This is a bit different from the current '--vacuum-pristines' option of 'svn
> cleanup', by the way, though maybe there should be some connection.

Let's disentangle this.

In 1.14, «svn cleanup --vacuum-pristines» is used to garbage collect
(GC) unreferenced pristines.  The need to run it manually has been
documented as a bug since https://subversion.apache.org/docs/release-notes/1.7#wc-pristines
(10 years ago).

In pristines-on-demand (as it stands / as currently envisioned)
a pristine may be either absent altogether, present because the file
_had been_ locally modified ("implicitly hydrated"), or present because
the user advised us that the file is _about to become_ locally modified
("explicitly hydrated").

All three cases fall under "Return to the state just after a fresh
checkout".  However, as you say, removing implicitly-downloaded
pristines is closer to GC'ing pristines, because both of these cases
dehydrate pristines that had been hydrated by the library's logic,
whereas removal of an explicitly-downloaded pristine undoes an explicit
user action.

As to UI…

- The GC case has no downsides unless the user downdates or switches the
  wc, in which case the unreferenced pristines might be used.

  [Remind me: is this reuse possible over all RA layers?]

- In the other two cases, the user is making an informed choice to
  dehydrate.  That is similar to «revert» in that an accidental use may
  have non-trivial costs, so we can consider some of «revert»'s
  behaviour of defaulting to --depth=empty and not using svn_opt_push_implicit_dot_target(),
  particularly if the file in question has local mods.

  I don't immediately see a reason to distinguish between explicitly-
  and implicitly-downloaded files at the UI level in this context.
  However, right now the question is whether we should be make this
  distinction at the library implementation level.  (For instance, the
  "keep the pristine until told to drop it" scenario implies being able
  to make such a distinction.)

  More precisely, the question is whether our design permits us to add
  this functionality to the library if and when a UI need for it will
  want to be implemented.

  I think it does.  Suppose we release 1.15 without library support for
  distinguishing implicitly/explicitly-hydrated pristines, and then want
  to add such support in 1.16.  I think 1.16 will be able to implement
  this without a format bump if it adds to the PRISTINE table a column
  declared as «manually_hydrated INTEGER NOT NULL DEFAULT ( 0 )»,
  provided 1.16 will handle the possibility that an old, 1.15 client
  will dehydrate the pristine in spite of the user's instruction.  (The
  DEFAULT constraint on column definitions is supported by the oldest
  SQLite version supported by 1.14.)

  So, unrolling the chain of logic, I think we'll be able to teach the
  backend to distinguish explicitly/implicitly hydrated pristines if and
  when the UI requires this.

>

Aside: Can we have two working copies share _only_ their pristine
stores?  That is, continue to have separate wc.db files, but use the
same on-disk pristine store?  That might be easier to implement than
shared wc.db's, and would be useful if multiple wc's need the same file
hydrated.

(Or sharing could happen at a lower level, as with
http://scord.sourceforge.net/, mentioned on issue #525 — although this
particular solution doesn't support wc-ng (≥1.7).)

> And maybe a --interactive option would be good, so the user can
> interactively choose which pristines to drop and which not!

Perhaps; but for now, we can let people write scripts around svn to
achieve this, like how svn-bisect(1) and backport.pl are external.
That's one reason for adding a line to `svn info`.

> I'm partly just thinking out loud here, to stimulate us all to think.  None
> of this affects the initial, whole-WC implementation, and of course let's
> keep in mind that the *main* use case will be already well-served by that
> initial implementation.  These further improvements are for the future, and
> perhaps we shouldn't even make them until we've all had some experience with
> the initial simple UI.

+1 to every single sentence of this paragraph.

Also:

> These further improvements are for the future, and perhaps we
> shouldn't even make them until we've all had some experience with the
> initial simple UI.

We _could_ make them in a way that doesn't require us to provide
compatibility for them forever, such as by releasing them as
"experimental" (cf. https://subversion.apache.org/docs/release-notes/1.10#shelving),
by releasing an alpha or a nightly and soliciting feedback for
that, or by prototyping in Python what can be so prototyped.

Cheers,

Daniel
(Aside: "explicitly-hydrated" is a bit of a mouthful.  I considered just
referring to these as "somatic" and "autonomous" pristines…)

Re: A two-part vision for Subversion and large binary objects.

Posted by Karl Fogel <kf...@red-bean.com>.

On 24 Jan 2022, Daniel Shahaf wrote:
>Sure!  And a script for running «hydrate» automatically could be 
>called
>"submerge". :)
>
>And I guess we'll want `svn info` to grow a "Last watered at:" 
>line.

As long as we alias 'svn mop' for 'svn cleanup', it's fine with me 
:-).

>Agreed, but perhaps have a --offline-only option to let people 
>say
>"Error out if you can't complete the operation without contacting 
>the
>server".  That might be useful for «revert», «diff», etc., as 
>well.

Yep.

>Use-case: to request "fail fast" behaviour rather than commence
>a (known to the user to be) long/expensive network retrieval.

Indeed, other tools implement that option, for that exact purpose. 
I think it's quite reasonable.

>> > - «svn commit --keep-pristines», in case Alice has two 
>> > logical changes
>> > that she'd like to make in separate commits?
>> 
>> Maybe, or maybe one just uses 'svn dehydrate' ('svn hydrate 
>> --dehydrate' :-)
>> ) when one is done working on the file.
>
>I figured «svn commit --keep-pristines» could be used not only 
>after
>manually hydrating but also after implicit hydrating (e.g., after
>«echo foo >> iota && svn diff iota»).

Before implementing these options (which obviously won't happen in 
the first iteration anyway), we should think carefully about 
naming and about how much of the underlying implementation detail 
we want to expose.

And there are broader UI/UX questions:

Maybe once someone starts working on a (large) file, they are 
likely to continue working on it.  In which case, we should keep 
the pristine until told to drop it.  Then 'svn cleanup' could have 
an additional behavior: "remove pristine for any unmodified file 
that would *normally* not have a pristine (except that the user 
manually caused it to have a pristine due to some special action 
or circumstance)".

This is a bit different from the current '--vacuum-pristines' 
option of 'svn cleanup', by the way, though maybe there should be 
some connection.

And maybe a --interactive option would be good, so the user can 
interactively choose which pristines to drop and which not!

I'm partly just thinking out loud here, to stimulate us all to 
think.  None of this affects the initial, whole-WC implementation, 
and of course let's keep in mind that the *main* use case will be 
already well-served by that initial implementation.  These further 
improvements are for the future, and perhaps we shouldn't even 
make them until we've all had some experience with the initial 
simple UI.

Best regards,
-Karl

Re: A two-part vision for Subversion and large binary objects.

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Karl Fogel wrote on Mon, Jan 24, 2022 at 10:50:14 -0600:
> On 24 Jan 2022, Daniel Shahaf wrote:
> > Which brings me to a less contrived / more general point: What if the
> > user _knows in advance_ they'll need a pristine?  Shouldn't there be: —
> > - a way to say "I'm about to change a large, diffable file; detranslate
> > it into the pristine store before I touch it"?  Perhaps even  make files
> > read-only at the OS level (as with svn:needs-lock)  so the user doesn't
> > modify the file accidentally until its  pristine has been set aside?
> 
> 'svn hydrate'?  (I can't even tell if I'm joking.)

Sure!  And a script for running «hydrate» automatically could be called
"submerge". :)

And I guess we'll want `svn info` to grow a "Last watered at:" line.

> > - a way to say "I've modified a large, diffable file and I'm about to go
> > offline; download a pristine for this file now"?
> 
> Same command, I think?
> 
> That is: the goal is to get a local pristine copy.  We already know the
> checksum(s) for the pristine and the clean working file (normally they'll be
> the same, unless there was keyword translation).  If we can detranslate the
> working file to get the pristine, then we do that; next option is to try
> fetching the pristine from the repository.

Agreed, but perhaps have a --offline-only option to let people say
"Error out if you can't complete the operation without contacting the
server".  That might be useful for «revert», «diff», etc., as well.

Use-case: to request "fail fast" behaviour rather than commence
a (known to the user to be) long/expensive network retrieval.

> > - «svn commit --keep-pristines», in case Alice has two logical changes
> > that she'd like to make in separate commits?
> 
> Maybe, or maybe one just uses 'svn dehydrate' ('svn hydrate --dehydrate' :-)
> ) when one is done working on the file.

I figured «svn commit --keep-pristines» could be used not only after
manually hydrating but also after implicit hydrating (e.g., after
«echo foo >> iota && svn diff iota»).

As to explicit hydrations:

- After «svn hydrate», should a revert dehydrate (by default)?

I'm not sure.  Someone might revert a patch they'd started writing in
order to re-do it; and someone might revert a patch they'd started
writing because they're abandoning it.

«revert» means "discard my local changes".  It doesn't mean "throw away
my ability to diff/revert cheaply".  So, I guess revert _shouldn't_
dehydrate an explicitly-hydrated file…?

For that matter, perhaps «svn revert iota» shouldn't dehydrate iota, but
leave iota.svn-base behind to be dehydrated by the _next_ svn operation…?
This could help if the revert is followed by further edits to iota.

- After «svn hydrate», should a commit dehydrate (by default)?

I guess so.

- After «svn hydrate», should an explicit «svn dehydrate» be an error if
  the file is locally modified?

Dehydrating _unmodified_ files should be a rare situation: it will only
be able to happen after «commit --keep-pristines» or «revert --keep-pristines».

Dehydrating a locally-modified file means that:

  + A subsequent revert will hit the network

  + If the file is deltifiable, subsequent commit won't be able to
    deltify against BASE

But then again, perhaps the file isn't deltifiable (whether or not it's
diffable) and the user knows they intend to commit it and not revert it.

I think I'm leaning towards not making this an error.  Unix doesn't try
to protect users from themselves, and besides, we can always make this
an error down the road.

Cheers,

Daniel

Re: A two-part vision for Subversion and large binary objects.

Posted by Karl Fogel <kf...@red-bean.com>.

On 24 Jan 2022, Daniel Shahaf wrote: 
>[...]   To be clear, I'm not trying to pick nits; I'm trying to 
>make sure that we don't make unwarranted assumptions.  We might 
>get a lightbulb moment from that.  (E.g., that's basically how we 
>realized we should deprecate --reintegrate, IIRC.) 

Agreed.  I found Julian's initial analysis very helpful, and still 
think it's overall reasonable & correct -- but now is definitely 
the time to probe our assumptions carefully, so your questions are 
good to ask. 

>Which brings me to a less contrived / more general point: What if 
>the user _knows in advance_ they'll need a pristine?  Shouldn't 
>there be: —  - a way to say "I'm about to change a large, 
>diffable file; detranslate 
>  it into the pristine store before I touch it"?  Perhaps even 
>  make files read-only at the OS level (as with svn:needs-lock) 
>  so the user doesn't modify the file accidentally until its 
>  pristine has been set aside? 

'svn hydrate'?  (I can't even tell if I'm joking.) 

>- a way to say "I've modified a large, diffable file and I'm 
>about to go 
>  offline; download a pristine for this file now"? 

Same command, I think?

That is: the goal is to get a local pristine copy.  We already 
know the checksum(s) for the pristine and the clean working file 
(normally they'll be the same, unless there was keyword 
translation).  If we can detranslate the working file to get the 
pristine, then we do that; next option is to try fetching the 
pristine from the repository.

>- «svn commit --keep-pristines», in case Alice has two logical 
>changes 
>  that she'd like to make in separate commits? 

Maybe, or maybe one just uses 'svn dehydrate' ('svn hydrate 
--dehydrate' :-) ) when one is done working on the file.

>+1 to starting with a per-WC knob.  However, all else being 
>equal, we should try to design this in a way that allows future 
>extensions, including extensions that set pristinefulness on a 
>finer granularity than per-WC.  (That's similar to what I said 
>earlier in this thread in [1]). 

Completely agree.  I assume this is what Julian had in mind all 
along.  Identifying those knobs now is a good idea, though, in 
case they have any design implications.

Best regards,
-Karl

Re: A two-part vision for Subversion and large binary objects.

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Julian Foad wrote on Thu, Jan 20, 2022 at 21:03:02 +0000:
> The only case in which a simple per-WC setting might be unsatisfactory
> is the following combination:

Why would it be the only case?

Point by point:

>   - the repository is "slow" (and/or offline working is required);

Agreed.  This can be assumed so there'll be a use-case for
pristinelessness for some files.

>   - and, in a single WC:
>     - the WC data set is "huge" (relative to local disk space) in total; and

Ditto.

>     - there is a subset of files on which the user needs to work
> (requiring diffs, etc.) often enough that fetching their pristines "on
> demand" is a problem; and

Disagree.  Why would fetching on-demand being a problem _necessarily_ be
caused by an "often enough" need to work on some files?  Why couldn't
on-demand fetching pristines be a problem for files that change once in
a blue moon?

For example, a laptop could easily be sometimes behind a wide downlink
and sometimes behind a narrow one.  If most of the time there's a wide
downlink, the user might enable optional pristines for most files, but
they might nevertheless have some files that are large and diffable and
may need to edited and diffed while behind a narrow downlink.

>     - that subset of files is not "huge" in total; and

I agree that that subset's pristines are necessarily able to be stored
locally at least from time to time, but no more than that.  It's not
_necessarily_ posssible to store those files' pristines permanently, and
the files themselves aren't necessarily small (people diffoscope(1) ISO
files nowadays).

>     - that subset of files can be distinguished from the rest by metadata.

Why is this necessarily the case whenever per-WC granularity wouldn't
suffice?

Actually, this seems to be a statement not about the _use-cases_ in
which per-WC granularity wouldn't suffice, but about the _solution_ that
would address those use-cases.  More specifically, this seems to rule
out solutions that involve hardcoded lists (à la svn:ignore without glob
patterns, svn:keywords, and «svn depth --set-depth=exclude»).

>

To be clear, I'm not trying to pick nits; I'm trying to make sure that
we don't make unwarranted assumptions.  We might get a lightbulb moment
from that.  (E.g., that's basically how we realized we should deprecate
--reintegrate, IIRC.)

>

Let me try to sketch a use-case for wanting only _some_ files to be
pristineless.

Suppose that:

- We wanted to distribute libsvn*.so via an svn repository.

- We were informed of a vulnerability and needed to commit updated
  binaries

- The RM had to commit the updated binaries whilst behind a narrow
  uplink

- The RM knew in advance they'd be behind a narrow downlink, but did not
  know in advance [a vulnerability would be reported and therefore] the
  binaries would need to be rebuilt and uploaded

- The RM is usually behind a wide downlink

In this situation, it would be reasonable and even prudent of the RM,
before they leave their wide downlink, to pre-hydrate the bases for 1.10
and 1.10 binaries, only.  This way they're prepared: they'll be able to
rebuild and push 1.10 and 1.14 binaries should a vulnerability come in;
and as to 1.13, well, that's no longer supported so it can wait until
they're back at their wide downlink location.

>

Which brings me to a less contrived / more general point: What if the
user _knows in advance_ they'll need a pristine?  Shouldn't there be: —

- a way to say "I'm about to change a large, diffable file; detranslate
  it into the pristine store before I touch it"?  Perhaps even make
  files read-only at the OS level (as with svn:needs-lock) so the user
  doesn't modify the file accidentally until its pristine has been set
  aside?

- a way to say "I've modified a large, diffable file and I'm about to go
  offline; download a pristine for this file now"?

- «svn commit --keep-pristines», in case Alice has two logical changes
  that she'd like to make in separate commits?

I realize that most of these ideas are more about the "large, diffable
files; network access is slow/expensive" case and less about the
"undiffable undeltifiable never-reverted large file; a 10Gbps connection
to the server" case.

> That is certainly a possible case, but we have no suggestion that it is
> at all common. It is not one of the cases driving this feature. So I
> think it is not something to design for at this stage.
> 
> I'm going to work on getting something more basic (per-WC yes/no) closer
> to production-ready and then we can re-assess it.

+1 to starting with a per-WC knob.  However, all else being equal, we
should try to design this in a way that allows future extensions,
including extensions that set pristinefulness on a finer granularity
than per-WC.  (That's similar to what I said earlier in this thread
in [1]).

Cheers,

Daniel

[1] <https://mail-archives.apache.org/mod_mbox/subversion-dev/202107.mbox/%3C20210730233846.GB23161%40tarpaulin.shahaf.local2%3E>, last hunk

Re: A two-part vision for Subversion and large binary objects.

Posted by Julian Foad <ju...@foad.me.uk>.

Karl Fogel wrote:
>>> So if we have client-side configuration that can specify "no 
>>> pristine" based on some combination of one or more of...
> [... size, properties, etc. ...]
> with a general mechanism for combining conditions, then things 
> will be in a good position for future improvement.

The more I think about this, the more I think we are prematurely
complicating the requirements in this respect. I'm going to back-track
and posit that a simple per-WC switch should suffice for the vast
majority of cases, and has the benefit of simplicity. (The user might
wish to set this based on the repository location -- local/fast versus remote/slow.)

I will note that I previously misunderstood the current
'pristines-on-demand' implementation as fetching the pristine before a
diff (for example) and discarding it afterwards.  In fact it keeps the
pristine as long as the file in question remains in a locally modified
state, and only discards the pristine when (before or after some client
operation) the file is no longer in a modified state. That is to say, it
fetches pristines less often than I had thought.

The only case in which a simple per-WC setting might be unsatisfactory
is the following combination:

  - the repository is "slow" (and/or offline working is required);

  - and, in a single WC:
    - the WC data set is "huge" (relative to local disk space) in total; and
    - there is a subset of files on which the user needs to work
(requiring diffs, etc.) often enough that fetching their pristines "on
demand" is a problem; and
    - that subset of files is not "huge" in total; and
    - that subset of files can be distinguished from the rest by metadata.

That is certainly a possible case, but we have no suggestion that it is
at all common. It is not one of the cases driving this feature. So I
think it is not something to design for at this stage.

I'm going to work on getting something more basic (per-WC yes/no) closer
to production-ready and then we can re-assess it.

Re: A two-part vision for Subversion and large binary objects.

Posted by Karl Fogel <kf...@red-bean.com>.

On 19 Jan 2022, Vincent Lefevre wrote:
>On 2022-01-13 22:38:34 -0600, Karl Fogel wrote:
>> So if we have client-side configuration that can specify "no 
>> pristine" based
>> on some combination of one or more of...
>> 
>>  - file size
>>  - repository of origin
>>  - path and/or basename
>>  - svn:mime-type property (if present)
>>  - some custom property
>
>I would add: the age of the last change of a file in the 
>repository.
>
>My personal repository has many files (currently 26937), but most 
>of
>them are no longer changed.

Interesting idea!  I hadn't thought of that at all, and it makes 
sense.

Of course, there may not be time to implement all of these UI 
routes right away.  But if we get the most important ones, along 
with a general mechanism for combining conditions, then things 
will be in a good position for future improvement.

Best regards,
-Karl

Re: A two-part vision for Subversion and large binary objects.

Posted by Vincent Lefevre <vi...@vinc17.net>.

On 2022-01-13 22:38:34 -0600, Karl Fogel wrote:
> So if we have client-side configuration that can specify "no pristine" based
> on some combination of one or more of...
> 
>  - file size
>  - repository of origin
>  - path and/or basename
>  - svn:mime-type property (if present)
>  - some custom property

I would add: the age of the last change of a file in the repository.

My personal repository has many files (currently 26937), but most of
them are no longer changed.

-- 
Vincent Lefèvre <vi...@vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

Re: A two-part vision for Subversion and large binary objects.

Posted by Karl Fogel <kf...@red-bean.com>.

On 12 Jan 2022, Julian Foad wrote:
>> No reason to upgrade an old WC until someone actually wants an 
>> optional pristine.
>
>In principle, an what we ideally desire, agreed.  Here I was just 
>saying
>what this branch does as it is now, before being combined with 
>the
>multi-wc-format work, which we're told is needed to accommodate 
>what we
>desire.  (I'll be looking into exactly what this means and 
>whether
>avoiding WC database changes and using on-disk pristine presence 
>alone
>is a feasible (perhaps even superior) alternative, as I 
>mentioned.)

Gotcha -- understood.

>> Also, the point of this feature is not to remove pristines for 
>> all 
>> unmodified files.  It's to make it possible for users to 
>> specific 
>> certain circumstances (generally involving large file size!) in 
>> which the pristine should be omitted *for certain files*.
>
>Understood.  That ability (to pick and choose which files it 
>applies to,
>by some client-side config) will need to be added.  I'll be 
>looking into it.

Cool.

Here's an example use case, in case it helps:

At our company, we have a separate repository for large binary 
assets, which we call our "bigdata" repository.  However, not 
everything in that repository is a giant multi-gigabyte blob -- 
there are also some README files, etc.  For those small files, one 
naturally wants the pristines, because 'svn diff' is useful 
locally on them.  But generally no one wants pristines for the 
large binary blob files.

So if we have client-side configuration that can specify "no 
pristine" based on some combination of one or more of...

  - file size
  - repository of origin
  - path and/or basename
  - svn:mime-type property (if present)
  - some custom property

...then any developer will be able to get whatever behavior they 
need given their local storage constraints.

Different people may make different choices based on available 
local storage.  A developer with a lot of local disk space might 
set her no-pristine size threshold to, say, 5 GB, and thus at 
least preserve 'svn revert' ability for those files (I don't think 
'svn diff' would be useful for any of the files, though there 
might be exceptions even to that).  Meanwhile, another developer 
with less disk space might choose 100 MB as the limit.

This is why I feel so strongly that the UI needs to be entirely 
client side -- only the client side has the information needed to 
make the appropriate decision.

By the way, I'll give some more details about our setup, since it 
involves a nice trick: our bigdata repository tree is a sparse 
mirror of our regular internal corporate repository tree, which is 
also in SVN because Subversion's path-based authz is so great when 
you have different clients / contractors / employees / partners 
all having access to different things.  (Note that our source code 
is in public Git repositories -- it's all open source.  The stuff 
I'm talking about here is not source code, though that doesn't 
matter for this discussion.)

Having the two SVN repositories be parallel means that we can use 
the same authn file and authz file for both :-).  So if a person 
has access to customer Foo's area in the regular repository, then 
by default they also have access to the bigdata assets for Foo as 
well.  (In the few cases where different access is needed, we just 
create an extra subdirectory and update the authz spec 
accordingly.)

Most people never need the binary assets, and so they don't pay 
those checkout or storage costs -- they never fetch from bigdata. 
But a few people *do* need access to the bigdata, and for some of 
it the checkout totals can run to hundreds of gigabytes, so 
avoiding pristines is not just a nice benefit but rather usually a 
necessity.

All this is just our use case, of course.  I don't mean that it is 
more
important than other use cases others may present.  I just wanted 
to
give some concreteness to the discussions.  I hope others will 
post with
their scenarios.

Best regards,
-Karl

Re: A two-part vision for Subversion and large binary objects.

Posted by Julian Foad <ju...@apache.org>.

Karl Fogel wrote:
>>    * pristines are missing until needed (for diff, commit, 
>>    [...]
> 
> Hmmm, why would pristines be needed for commit?

Oh you're right, I forgot, we've already been through this: originally
it fetched them, people discussed, then Evgeny posted a tweak to make it
send full texts instead.

>> FWIW, the interop behaviour of current 'pristines-on-demand' 
>> branch by itself is:
>> 
>>    * new svn errors on an old WC; recommends 'svn upgrade'
>>    * new svn 'svn upgrade' quickly upgrades the WC in place, 
>>    removing
>> pristines of all unmodified files
>>    * old svn errors on a new WC
> 
> No reason to upgrade an old WC until someone actually wants an 
> optional pristine.

In principle, an what we ideally desire, agreed.  Here I was just saying
what this branch does as it is now, before being combined with the
multi-wc-format work, which we're told is needed to accommodate what we
desire.  (I'll be looking into exactly what this means and whether
avoiding WC database changes and using on-disk pristine presence alone
is a feasible (perhaps even superior) alternative, as I mentioned.)

> Also, the point of this feature is not to remove pristines for all 
> unmodified files.  It's to make it possible for users to specific 
> certain circumstances (generally involving large file size!) in 
> which the pristine should be omitted *for certain files*.

Understood.  That ability (to pick and choose which files it applies to,
by some client-side config) will need to be added.  I'll be looking into it.

> I expect an old svn to error on a new WC *when that new WC 
> actually has some already-omitted pristines*.  Other than that 
> circumstance, there's no reason an old SVN shouldn't work -- it'll 
> just not implement the "optional pristines for certain files" 
> feature, and the working copy will be larger than it otherwise 
> might have been.  If it's important to the user to upgrade their 
> svn to get some optional pristines, then the user can do so.

Understood, and exploring to what extent that may be possible.

- Julian

Re: A two-part vision for Subversion and large binary objects.

Posted by Evgeny Kotkov <ev...@visualsvn.com>.

Julian Foad <ju...@apache.org> writes:

> Hello everyone. Thanks to sponsorship arranged by Karl, I'm able to work
> on completing this.

Hi Julian,

It's great to have you on board!

> Important Question:
>
>   * Is the approach with a WC format bump the best approach?
>
> I see two options to consider:
>
>   * with format bump like this, plus completing the multi-WC-format
> branch work
>   * re-work the implementation of the concept, to not change the
> declared format number or DB schema, but instead to query "is the file
> present on disk" instead of the DB 'hydrated' flag
>
> For each of those options,
>
>   * what is the outcome in terms of interoperability?
>   * how much work is involved? (or is the 2nd one impossible?)

Between these two options, I don't think that it would be actually possible to
have the pristines-on-demand functionality available without a format bump,
as the existing clients working with the current format rely on the pristine
contents always being available on disk.

Personally, I tend to think that making a format bump and accompanying it with
the multi-wc-format support should work pretty well from a user's standpoint.

> Right now I'm reviewing this work and the long discussion about it and
> I will come back with a summary and plan for your consideration, and no
> doubt many questions, in the next few days.

Perhaps, one way to start off would be to fix the temporary assumption that
any working copy of the newer format always fetches the pristines on-demand.

While the final way of configuring that undoubtedly is an important question to
answer, there might be a way to make progress while keeping the configuration
simple, at least for now:

— Make the "are pristines being fetched on-demand" a property of the
  working copy
— Introduce a new configuration option that controls this new behavior
  and is applied at the moment when a working copy is created

I think that carrying these steps would bring the feature closer to its final
shape, and would also mean that a lot of the important subsurface work (such
as preserving the old behavior for existing working copies, running the tests
for both configurations, etc.) is already in place, so that we wouldn't have
to think about it further on.

Thanks,
Evgeny Kotkov

Re: A two-part vision for Subversion and large binary objects.

Posted by Nathan Hartman <ha...@gmail.com>.

On Wed, Jan 12, 2022 at 4:07 PM Julian Foad <ju...@apache.org> wrote:
>
> Nathan Hartman wrote:
> > It sounds like the feature is planned to be opt-in [...]
> > but do we want to consider opt-out [...or...] e.g., "smart" [...]?
>
> Opt-in sounds most sensible to me.
>
> Rationale:  A default behaviour change that would cause problems for
> those who expect certain operations to work offline would be Bad.  For
> existing users for whom the feature would be useful, it is probably
> worth their effort to make a config setting to opt in.  For users new to
> Subversion, a default to some kind of "smart" choice, whether as trivial
> as a fixed size threshold or as complex as calculated taking into
> account network throughput -- that would be nice -- defaults are
> powerful -- but it would not be wise to do this at the expense of
> breaking some existing users' expectations.  We can't tell
> programatically whether a user has "new user" or "existing user"
> expectations -- the creation of "~/.subversion" at install time, for
> example, doesn't tell us that.

This sounds quite sensible to me, especially for the command line
client. I suppose 3rd party tools, e.g., GUIs, could implement some
"smart" behaviors as long as APIs exist that make it possible, but I
agree we should not mess with user expectations of being able to work
offline by default.

(It's a Good Thing that this is stated explicitly.)

More inline, below...

> > Once a pristine is fetched, do we want to delete it automatically or wait
> > until the user runs a cleanup operation?
>
> As the rationale for the feature is that network fetches are cheap
> relative to storage cost, and favouring simplicity over complexity, it
> seems clear to me that Evgeny took the correct approach, maintaining an
> invariant that pristines only remain present while needed for the
> current operation.

Simplicity is a good argument. I'm good with that. More below...

> > Also there is the question of how the option is activated: per-file, per-wc,
> > per-user, per-system? [...]
> >
> > Answer(s) to the above questions may have a pretty significant effect
> > on the
> > scope and design of the feature, both immediate and possible future
> > directions, so I think it's important to think about this carefully.
>
> I propose making the low level libraries :
>
>     * the WC level decision making be per file, making a call-back to
> the client/app level, passing the file's basic metadata (size, details
> from 'svn info') and subversion properties;
>     * (to reduce callback overhead allow querying multiple files per call);
>     * we implement the callback in 'svn' command-line based on a simple
> setting that the user can put in the user config file
> (~/.subversion/config on *nix)

+1 to the above points.

>     * we choose, now, the simple config setting shall be just a size
> threshold alone (likely good enough);

Will it complicate things too much to provide the possibility of a
particular wc being no-pristines? E.g., a user could supply some switch
to 'svn checkout'. If this is possible, it handles the 2nd example use
case I mentioned, where a project may contain a very large number of
files, not necessarily large binaries, and the user desires to exclude
pristines for all of them. Perhaps the network is very fast, or repo
and wc are on the same machine, and/or wc is on a ramdrive... This
could be quite beneficial in a number of different ways and opens up
some interesting future directions like repo-in-wc.

>     * consider possibly allowing it to factor in other
> metadata/properties, but that quickly gets complex and I can't imagine
> much gain in most cases;
>     * we make libsvn_client use by default that same callback
> implementation so third-party client implementations work the same as
> ours, by default;
>     * a different client (such as a GUI client) may supply its own
> callback to libsvn_client to get different control, including for
> example the possibility of per-wc differences (by looking at the
> svn_info.wc_root metadata) or controlled by a subversion property on the
> file or on the wc root, and so on.

+1.

Cheers,
Nathan

Re: A two-part vision for Subversion and large binary objects.

Posted by Julian Foad <ju...@apache.org>.

Nathan Hartman wrote:
> It sounds like the feature is planned to be opt-in [...]
> but do we want to consider opt-out [...or...] e.g., "smart" [...]?

Opt-in sounds most sensible to me.

Rationale:  A default behaviour change that would cause problems for
those who expect certain operations to work offline would be Bad.  For
existing users for whom the feature would be useful, it is probably
worth their effort to make a config setting to opt in.  For users new to
Subversion, a default to some kind of "smart" choice, whether as trivial
as a fixed size threshold or as complex as calculated taking into
account network throughput -- that would be nice -- defaults are
powerful -- but it would not be wise to do this at the expense of
breaking some existing users' expectations.  We can't tell
programatically whether a user has "new user" or "existing user"
expectations -- the creation of "~/.subversion" at install time, for
example, doesn't tell us that.

> Once a pristine is fetched, do we want to delete it automatically or wait
> until the user runs a cleanup operation?

As the rationale for the feature is that network fetches are cheap
relative to storage cost, and favouring simplicity over complexity, it
seems clear to me that Evgeny took the correct approach, maintaining an
invariant that pristines only remain present while needed for the
current operation.

> Also there is the question of how the option is activated: per-file, per-wc,
> per-user, per-system? [...]
> 
> Answer(s) to the above questions may have a pretty significant effect
> on the
> scope and design of the feature, both immediate and possible future
> directions, so I think it's important to think about this carefully.

I propose making the low level libraries :

    * the WC level decision making be per file, making a call-back to
the client/app level, passing the file's basic metadata (size, details
from 'svn info') and subversion properties;
    * (to reduce callback overhead allow querying multiple files per call);
    * we implement the callback in 'svn' command-line based on a simple
setting that the user can put in the user config file
(~/.subversion/config on *nix)
    * we choose, now, the simple config setting shall be just a size
threshold alone (likely good enough);
    * consider possibly allowing it to factor in other
metadata/properties, but that quickly gets complex and I can't imagine
much gain in most cases;
    * we make libsvn_client use by default that same callback
implementation so third-party client implementations work the same as
ours, by default;
    * a different client (such as a GUI client) may supply its own
callback to libsvn_client to get different control, including for
example the possibility of per-wc differences (by looking at the
svn_info.wc_root metadata) or controlled by a subversion property on the
file or on the wc root, and so on.

How does all that sound?

- Julian

Re: A two-part vision for Subversion and large binary objects.

Posted by Nathan Hartman <ha...@gmail.com>.

On Wed, Jan 12, 2022 at 1:40 PM Karl Fogel <kf...@red-bean.com> wrote:
>
> On 12 Jan 2022, Julian Foad wrote:
> >It works as advertised in simple usage:
> >
> >    * pristines are missing until needed (for diff, commit,
> >    revert,
> >resolve, etc.),
> >    * corresponding disk space reduction
> >    * (and speed differences such as faster checkout, slower
> >    diff)
> >    * fetches pristines when needed, deletes them afterwards
>
> Hmmm, why would pristines be needed for commit?
>
> >It also passes the (branch) test suite, although I note on the
> >branch
> >there are modifications to some tests and some are skipped; I
> >haven't
> >yet reviewed what and why.
> >
> >In skimming through the branch diff I noted:
> >
> >    * extra 'hydrated' flag column added in the WC DB 'pristine'
> >    table,
>
> Very nice choice of adjective -- kudos to Evgeny :-).
>
> >FWIW, the interop behaviour of current 'pristines-on-demand'
> >branch by
> >itself is:
> >
> >    * new svn errors on an old WC; recommends 'svn upgrade'
> >    * new svn 'svn upgrade' quickly upgrades the WC in place,
> >    removing
> >pristines of all unmodified files
> >    * old svn errors on a new WC
>
> No reason to upgrade an old WC until someone actually wants an
> optional pristine.
>
> Also, the point of this feature is not to remove pristines for all
> unmodified files.  It's to make it possible for users to specific
> certain circumstances (generally involving large file size!) in
> which the pristine should be omitted *for certain files*.
>
> I expect an old svn to error on a new WC *when that new WC
> actually has some already-omitted pristines*.  Other than that
> circumstance, there's no reason an old SVN shouldn't work -- it'll
> just not implement the "optional pristines for certain files"
> feature, and the working copy will be larger than it otherwise
> might have been.  If it's important to the user to upgrade their
> svn to get some optional pristines, then the user can do so.
>
> >    * Does it fetch more pristines from the server than are
> >    needed by
> >the operation in progress, in some cases?
>
> Like maybe commit?  :-)
>
> Best regards,
> -Karl


I was about to click "send" and then Karl replied, so I'll update my reply
accordingly... ok done :-)

It sounds like the feature is planned to be opt-in (working copies are same as
before unless the feature is activated), but do we want to consider opt-out
(new working copies are smaller by default unless users request locally cached
pristines) or some other option such as, e.g., "smart" (applies automatically
to files over some size, method to set this threshold TBD)?

Once a pristine is fetched, do we want to delete it automatically or wait
until the user runs a cleanup operation?

Also there is the question of how the option is activated: per-file, per-wc,
per-user, per-system? I could envisage at least one example use case for each
of these possibilities:

* per-file: A repository contains one unusually large file that does not
change often; the admin sets it to no-pristine to save everyone the headache

* per-wc: A developer with a fast connection to the repository checks out a
large codebase (hundreds of MiB of sources) onto a RAMDISK for faster
compile/debug cycles with reduced wear on FLASH memories; network overhead for
diffing, etc., is minimal

* per-user / per-system: The user's repo and wc are on the same machine and
there is never a need for pristines.

Answer(s) to the above questions may have a pretty significant effect on the
scope and design of the feature, both immediate and possible future
directions, so I think it's important to think about this carefully.

Cheers,
Nathan

Re: A two-part vision for Subversion and large binary objects.

Posted by Karl Fogel <kf...@red-bean.com>.

On 12 Jan 2022, Julian Foad wrote:
>It works as advertised in simple usage:
>
>    * pristines are missing until needed (for diff, commit, 
>    revert,
>resolve, etc.),
>    * corresponding disk space reduction
>    * (and speed differences such as faster checkout, slower 
>    diff)
>    * fetches pristines when needed, deletes them afterwards

Hmmm, why would pristines be needed for commit?

>It also passes the (branch) test suite, although I note on the 
>branch
>there are modifications to some tests and some are skipped; I 
>haven't
>yet reviewed what and why.
>
>In skimming through the branch diff I noted:
>
>    * extra 'hydrated' flag column added in the WC DB 'pristine' 
>    table,

Very nice choice of adjective -- kudos to Evgeny :-).

>FWIW, the interop behaviour of current 'pristines-on-demand' 
>branch by
>itself is:
>
>    * new svn errors on an old WC; recommends 'svn upgrade'
>    * new svn 'svn upgrade' quickly upgrades the WC in place, 
>    removing
>pristines of all unmodified files
>    * old svn errors on a new WC

No reason to upgrade an old WC until someone actually wants an 
optional pristine.

Also, the point of this feature is not to remove pristines for all 
unmodified files.  It's to make it possible for users to specific 
certain circumstances (generally involving large file size!) in 
which the pristine should be omitted *for certain files*.

I expect an old svn to error on a new WC *when that new WC 
actually has some already-omitted pristines*.  Other than that 
circumstance, there's no reason an old SVN shouldn't work -- it'll 
just not implement the "optional pristines for certain files" 
feature, and the working copy will be larger than it otherwise 
might have been.  If it's important to the user to upgrade their 
svn to get some optional pristines, then the user can do so.

>    * Does it fetch more pristines from the server than are 
>    needed by
>the operation in progress, in some cases?

Like maybe commit?  :-)

Best regards,
-Karl

Re: A two-part vision for Subversion and large binary objects.

Posted by Julian Foad <ju...@apache.org>.

OK everyone, here are lots of observations and questions to start off.

I have built and "kicked the tyres of" the current 'pristines-on-demand'
branch, and have skimmed through its changes.

It works as advertised in simple usage:

    * pristines are missing until needed (for diff, commit, revert,
resolve, etc.),
    * corresponding disk space reduction
    * (and speed differences such as faster checkout, slower diff)
    * fetches pristines when needed, deletes them afterwards

It also passes the (branch) test suite, although I note on the branch
there are modifications to some tests and some are skipped; I haven't
yet reviewed what and why.

In skimming through the branch diff I noted:

    * extra 'hydrated' flag column added in the WC DB 'pristine' table,
    * corresponding format bump and upgrade code,
    * corresponding modifications to WC-layer pristines handling,
    * 'textbase_sync' calls added before and after relevant client-layer
operations, to hydrate and dehydrate any relevant files, seems to be the
main substance of the high level logic.

I would very much appreciate anyone able to review the WC code changes
to any extent.  I feel the depth of review I can offer is limited in
this area.

Important Question:

    * Is the approach with a WC format bump the best approach?

I see two options to consider:

    * with format bump like this, plus completing the multi-WC-format
branch work
    * re-work the implementation of the concept, to not change the
declared format number or DB schema, but instead to query "is the file
present on disk" instead of the DB 'hydrated' flag

For each of those options,

    * what is the outcome in terms of interoperability?
    * how much work is involved? (or is the 2nd one impossible?)

How would these compare as overall solutions from a user perspective?

Some notes:

    * Reworking Evgeny's code would obviously require considerable effort.

    * Using a new field in the database of course is the best way to
store that data measured by cost per lookup. However that by itself
doesn't guarantee the best overall end result.

    * Getting the multi-WC-format work done now also buys us future
WC-format compatibility scenarios as a bonus.

    * If the multi-WC-format branch is incorporated, I'm not yet clear
what interop story that gives us. I'll try to learn that next.

FWIW, the interop behaviour of current 'pristines-on-demand' branch by
itself is:

    * new svn errors on an old WC; recommends 'svn upgrade'
    * new svn 'svn upgrade' quickly upgrades the WC in place, removing
pristines of all unmodified files
    * old svn errors on a new WC

The user experience of the current 'pristines-on-demand' branch seems
pretty good for basic scenarios. Questions I haven't asnwered yet:

    * Does it fetch more pristines from the server than are needed by
the operation in progress, in some cases?

While I'm working on answering all these questions, I would very much
appreciate any advice and insight anyone can offer.

- Julian

Re: A two-part vision for Subversion and large binary objects.

Posted by Karl Fogel <kf...@red-bean.com>.

On 11 Jan 2022, Julian Foad wrote: 
>Hello everyone. Thanks to sponsorship arranged by Karl, I'm able 
>to work on completing this. 
 
Yay!  Very glad you're on board for this, Julian.   I should say 
that the sponsorship comes from a consortium of several companies 
-- Open Tech Strategies LLC (my company) is just one of them. 
I'll check with the others to see if they want to be named. 
 
>It's fantastic to see that Evgeny has made a working prototype 
>and that many of you already followed with constructive 
>suggestions.   Right now I'm reviewing this work and the long 
>discussion about it and I will come back with a summary and plan 
>for your consideration, and no doubt many questions, in the next 
>few days. 
 
Looking forward to that, and many thanks to Evgeny for the work 
he's already on issue #525.

Best regards,
-Karl

Re: A two-part vision for Subversion and large binary objects.

Posted by Daniel Sahlberg <da...@gmail.com>.

Den tis 11 jan. 2022 15:34Nathan Hartman <ha...@gmail.com> skrev:

> On Tue, Jan 11, 2022 at 9:26 AM Julian Foad <ju...@apache.org> wrote:
> >
> > Hello everyone. Thanks to sponsorship arranged by Karl, I'm able to work
> on completing this.
> >
> > It's fantastic to see that Evgeny has made a working prototype and that
> many of you already followed with constructive suggestions.
> >
> > Right now I'm reviewing this work and the long discussion about it and I
> will come back with a summary and plan for your consideration, and no doubt
> many questions, in the next few days.
> >
> > --
> > - Julian
>
> Yay!! That's great news. Thanks to Karl, Evgeny, Julian, the
> sponsor(s), and everyone involved in making this possible. I'm looking
> forward to this.
>

I couldnt agree more! Great news!

/Daniel

>

Re: A two-part vision for Subversion and large binary objects.

Posted by Nathan Hartman <ha...@gmail.com>.

On Tue, Jan 11, 2022 at 9:26 AM Julian Foad <ju...@apache.org> wrote:
>
> Hello everyone. Thanks to sponsorship arranged by Karl, I'm able to work on completing this.
>
> It's fantastic to see that Evgeny has made a working prototype and that many of you already followed with constructive suggestions.
>
> Right now I'm reviewing this work and the long discussion about it and I will come back with a summary and plan for your consideration, and no doubt many questions, in the next few days.
>
> --
> - Julian

Yay!! That's great news. Thanks to Karl, Evgeny, Julian, the
sponsor(s), and everyone involved in making this possible. I'm looking
forward to this.

Cheers,
Nathan