You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@subversion.apache.org by Karl Fogel <kf...@red-bean.com> on 2022/11/05 22:12:50 UTC

Getting to first release of pristines-on-demand feature (#525).

Hi, all.  This is a high-level mail in which I try to figure out 
the current status of the issue #525 work and what's left to land 
it in trunk and release it.  Corrections and feedback welcome.

To remind everyone:

The purpose of this work is to reduce checkout sizes by optionally 
not having local pristine text-bases for WC files.  In trees that 
have lots of large binary files, this can reduce disk usage by 
about half, so it really matters for some use cases.  Also: in the 
long run, we want the user to be able to specify which files do 
and don't have pristines (but in the first release it can be a 
per-WC choice).

Current status as I understand it:

First, Julian has written up a great description of how the 
feature works from a user's perspective:

https://svn.apache.org/viewvc/subversion/branches/pristines-on-demand-on-mwf/notes/i525/i525-user-guide.md?view=log

Based on that document, it looks to me like we still need some 
well-named knobs by which the user can control this feature. 
Right now, the command-line way looks like one of these:

  $ svn checkout --compatible-version=1.15
  $ svn upgrade --compatible-version=1.15

However, there's a "TODO" note that addresses this UI point:

  > [TODO] We might change this so that upgrading to 
  > 1.15-compatible
  > format and enabling "i525pod" are separate steps and the 
  > latter is
  > optional.

I think we should implement that TODO before releasing the 
feature.  Ideally, the new WC format would support the 
"pristines-on-demand" feature without forcing a given WC to be in 
p-o-d mode.

Right now, if I understand correctly, a WC can either be entirely 
in p-o-d mode or entirely in regular mode (i.e., the current 
default, with pristines are always present for everything).  In 
other words, in its first release, this feature would *not* allow 
users specify that certain files in a WC should be p-o-d while 
other files are regular (but see the note "Now, a subtle point..." 
below about this).  It's a whole-WC thing.

However, I think it's okay to release this feature that way, 
without support for selective per-file p-o-d, as long as the UI 
for per-WC toggling is clear (e.g., not a flag like 
"--compatible-version=1.15", which doesn't say anything about the 
actual behavior being toggled).

("Toggle" may be the wrong word here, as I believe we also don't 
yet have a way to bring a WC back from p-o-d to regular mode.  Do 
we care about that for release?)

Now, a subtle point about this UI issue:

In the bright future, when we *do* support per-file specification 
of p-o-d-ness, there would be no need for a per-WC flag at all. 
Instead, users would specify that certain files should be p-o-d 
either by using client-side configuration options (e.g., all files 
larger than a given size, or having certain MIME type(s), are in 
p-o-d mode), or via command line actions to support explicit 
"hydrate" and "dehydrate" operations (these actions would either 
be top-level subcommands or options to existing commands -- we 
don't need to decide that detail now).

I guess what I'm saying is, if we are *close* to having the 
underlying WC support needed to support per-file selection of 
p-o-d-ness, then maybe it's better to go all the way and just 
finish that.  *Then* people could simply upgrade their working 
copies as usual, with no immediate behavior change resulting from 
that upgrade, and this new feature would then be available to 
them.  We would then offer...

  $ svn checkout --store-pristines=no
  $ svn upgrade --store-pristines=no

...as the gateways to the feature in the first release (so 
p-o-d-ness would to every file in the WC), and add selective UI in 
later releases, knowing that the underlying UI already supports 
it.  However, if that's a complex change in the WC code, then 
let's just release with whole-WC support and not delay.

Have I summarized the current status accurately?  Thoughts?

Please see also Julian's status email from April, which goes into 
more detail about which tests need updating, etc:

  https://lists.apache.org/thread/lm98og8jqonffcs250q5y3ft5r5qlmk5

  From: Julian Foad
  To: Daniel Shahaf
  Cc: Subversion Dev, Karl Fogel
  Subject: Re: A two-part vision for Subversion and large binary 
  objects.
  Date: Tue, 5 Apr 2022 15:50:56 +0100
  Message-ID: 
  <70...@getmailspring.com>

By the way, in that thread, Evgeny Kotkov -- whose initial work 
much of this is based on -- follows up with a patch that does a 
first-pass implementation of 'svn checkout --store-pristines=no' 
(by implementing a new persistent setting in wc.db).

Note that Julian and Daniel originally undertook this work as part 
of a contract with my company (which represents a consortium of 
companies interested in this feature).  Mostly it was Julian 
writing new code and Daniel reviewing and writing tests, and I 
thank both of them for having gotten us this far.

The work went a bit over budget not through any fault of theirs, 
but because we ran into an unexpected snag having to do with order 
of network operations in Subversion.  TL;DR: even though in 
*theory* an operation can always know at the beginning which 
pristines it has locally and which ones it doesn't, Subversion's 
current client/server communications conventions don't take 
advantage of that information in the way we'd want.  Instead, the 
client assumes pristines are present and sends up-front revision 
information to the server, causing the server to send responses 
that rely on those pristines being present.  The whole way the 
client and server talk to each other is based on this; it's 
fixable, of course, but doing so is not simple and probably not 
just client-side.  So the 'pristines-on-demand-on-mwf' branch 
takes a reasonable-but-not-perfect solution for now; the 
'pristines-on-demand-issue4892' that branches from it improves the 
situation [1], but is not complete and needn't block release. 
(See [2] for deeper discussion.)

I'll talk privately with them about finishing this and the budget 
required to do so.  I think we're close and would really like to 
see this feature released soon.  (Note that we have merged the 
'multi-wc-format' branch to trunk, in r1898187 on 2022-02-18. 
IIUC that was a necessary predecessor to everything else.)

We should be able to get there from here, right?

Best regards,
-Karl

[1] This command will give you some sense of the difference 
between those two branches:

  $ svn diff 
  https://svn.apache.org/repos/asf/subversion/branches/pristines-on-demand-on-mwf/notes/i525/i525-user-guide.md 
  https://svn.apache.org/repos/asf/subversion/branches/pristines-on-demand-issue4892/notes/i525/i525-user-guide.md

[2] 
https://lists.apache.org/thread/mwo5zy14wlkbs8j4334zn0296dl472qd

    From: Evgeny Kotkov
    To: Julian Foad
    Cc: Subversion Dev
    Subject: Re: Issue #525/#4892: on only fetching the pristines 
    we really need
    Date: Fri, 11 Mar 2022 18:23:55 +0300
    Message-ID: 
    <CA...@mail.gmail.com>

Re: Getting to first release of pristines-on-demand feature (#525).

Posted by Nathan Hartman <ha...@gmail.com>.

On Sat, Nov 5, 2022 at 6:13 PM Karl Fogel <kf...@red-bean.com> wrote:
>
> Hi, all.  This is a high-level mail in which I try to figure out
> the current status of the issue #525 work and what's left to land
> it in trunk and release it.  Corrections and feedback welcome.

Thanks for the overview and the work already done to make this
possible!

The P-O-D feature itself works.

What's left to do for a first release, IMHO:

(1) Decide on user-facing names for the feature and its command line
switch(es).

(2) Resolve the [TODO] that Karl mentions (decoupling the compatible
version switch from the i525pod switch).

Though there are many other possible enhancements, some of them touched
upon in Karl's message, I think these two items are the only really
crucial ones for a first release.

I have much to say on both of these but I won't go into detail yet
because that would hijack the thread away from the high-level topic of:
what remains to be done for initial viable product? I'd like to give
others a chance to respond before we dive down the rabbit hole. :-)

It's better if each of the above becomes a thread devoted to that
topic.

I'll point out that some initial release note text was drafted at [1].

Cheers,
Nathan

[1] https://subversion-staging.apache.org/docs/release-notes/1.15.html#bare-working-copies

Re: Getting to first release of pristines-on-demand feature (#525).

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Nathan Hartman wrote on Wed, Dec 07, 2022 at 20:29:11 -0500:
> On Wed, Dec 7, 2022 at 12:11 PM Evgeny Kotkov via dev <
> dev@subversion.apache.org> wrote:
> 
> >
> > I think that the `pristines-on-demand-on-mwf` branch is now ready for a
> > merge to trunk.  I could do that, assuming there are no objections.
> 
> 
> 
> I'd like to echo what others have already said by saying a great big THANK
> YOU, to all who have worked on this cool new feature so far!
> 
> I used an earlier incarnation of this branch some months ago in real usage
> scenarios with good results and looking at the recent commit emails as
> they've happened everything looks sensible to me.
> 
> I will try to run the full test suite in the next couple of days and
> assuming the tests pass for me I'll use it as my daily driver to test the
> real usage. Obviously I'll post here if I find anything...
> 
> Meanwhile I'd like to say that on further thought and after reading Johan's
> and Karl's feedback regarding the feature switch naming, I've come around
> to the point of view that --store-pristine={yes|no} is a perfectly fine UI.
> 

Well, if we're bikeshedding anyway, how about --backend-tweaks=without-pristines?
We can support just two values for starters ("without pristines" and
"with pristines"), and have the room to extend this in 1.16, similar to
--trust-server-cert/--trust-server-cert-failures and
--pre-1.4-compatible/--compatible-version.

Similarly, a new config file section with one valid option might make
sense if we anticipate adding more options to that section in the
future.  This way we avoid having the configuration split across two
places.

> Given that this is now the command line switch name, and since users are
> given direct control over the pristinefulness of a WC, and we've been
> calling this feature Pristines On Demand since its inception, I think we
> should finally bless this as the official name of the feature.
> 
> In the next couple of days I plan to update the staged 1.15 release notes,
> which until now tentatively called it Bare Working Copies, to call it
> Pristines On Demand and to complete the description there.
> 
> Regarding the SHA hash question:
> 
> While here, I would like to raise a topic of incorporating a switch from
> > SHA1 to a different checksum type (without known collisions) for the new
> > working copy format.  This topic is relevant to the pristines-on-demand
> > branch, because the new "is the file modified?" check relies on the
> > checksum
> > comparison, instead of comparing the contents of working and pristine
> > files.
> >
> > And so while I consider it to be out of the scope of the
> > pristines-on-demand
> > branch, I think that we might want to evaluate if this is something that
> > should be a part of the next release.
> 
> 
> Is it feasible and would it be beneficial to somehow decouple the hash code
> type from the wc format version? Asking because IIRC the need for a format
> bump to change hashes was one of the reasons it wasn't done a few years ago.

Maybe if we teach f32 to read /two/ new checksum kinds?  E.g., if we
teach f32 to read both SHA-512 and SHA-3, then even if 1.15 f32 writes
SHA-512 by default, it will nevertheless be able to read f32 wc's with
SHA-3 rows that 1.16 might create.

svn_checksum_kind_t's possible values include svn_checksum_fnv1a_32, so
I guess we already support reading wc.db's that use FNV-1a checksums?
(Incidentally, f31 is new in 1.8 whereas svn_checksum_fnv1a_32 is new
in 1.9.)

Cheers,

Daniel

Re: Getting to first release of pristines-on-demand feature (#525).

Posted by Nathan Hartman <ha...@gmail.com>.

On Wed, Dec 7, 2022 at 12:11 PM Evgeny Kotkov via dev <
dev@subversion.apache.org> wrote:

>
> I think that the `pristines-on-demand-on-mwf` branch is now ready for a
> merge to trunk.  I could do that, assuming there are no objections.

I'd like to echo what others have already said by saying a great big THANK
YOU, to all who have worked on this cool new feature so far!

I used an earlier incarnation of this branch some months ago in real usage
scenarios with good results and looking at the recent commit emails as
they've happened everything looks sensible to me.

I will try to run the full test suite in the next couple of days and
assuming the tests pass for me I'll use it as my daily driver to test the
real usage. Obviously I'll post here if I find anything...

Meanwhile I'd like to say that on further thought and after reading Johan's
and Karl's feedback regarding the feature switch naming, I've come around
to the point of view that --store-pristine={yes|no} is a perfectly fine UI.

Given that this is now the command line switch name, and since users are
given direct control over the pristinefulness of a WC, and we've been
calling this feature Pristines On Demand since its inception, I think we
should finally bless this as the official name of the feature.

In the next couple of days I plan to update the staged 1.15 release notes,
which until now tentatively called it Bare Working Copies, to call it
Pristines On Demand and to complete the description there.

Regarding the SHA hash question:

While here, I would like to raise a topic of incorporating a switch from
> SHA1 to a different checksum type (without known collisions) for the new
> working copy format.  This topic is relevant to the pristines-on-demand
> branch, because the new "is the file modified?" check relies on the
> checksum
> comparison, instead of comparing the contents of working and pristine
> files.
>
> And so while I consider it to be out of the scope of the
> pristines-on-demand
> branch, I think that we might want to evaluate if this is something that
> should be a part of the next release.

Is it feasible and would it be beneficial to somehow decouple the hash code
type from the wc format version? Asking because IIRC the need for a format
bump to change hashes was one of the reasons it wasn't done a few years ago.

Cheers,
Nathan

Re: Getting to first release of pristines-on-demand feature (#525).

Posted by Karl Fogel <kf...@red-bean.com>.

On 13 Dec 2022, Evgeny Kotkov wrote:
>Evgeny Kotkov <ev...@visualsvn.com> writes:
>Merged in https://svn.apache.org/r1905955

W00t!!  Thank you, and Julian and Daniel and everyone who's 
contributed to this.

So... do we have a release manager?  :-)

Re: Getting to first release of pristines-on-demand feature (#525).

Posted by Evgeny Kotkov via dev <de...@subversion.apache.org>.

Evgeny Kotkov <ev...@visualsvn.com> writes:

> Merged in https://svn.apache.org/r1905955
>
> I'm going to respond on the topic of SHA1 a bit later.

For the history: thread [1] proposes the `pristine-checksum-salt` branch that
adds the infrastructure to support new pristine checksum kinds in the working
copy and makes a switch to the dynamically-salted SHA1.

From the technical standpoint, I think that it would be better to release
the first version of the pristines-on-demand feature having this branch
merged, because now we rely on the checksum comparison to determine if a
file has changed — and currently it's a checksum kind with known collisions.

At the same time, having that branch merged probably isn't a formal release
blocker for the pristines-on-demand feature.  Also, considering that the
`pristine-checksum-salt` branch is currently vetoed by danielsh (presumably,
for an indefinite period of time), I'd like to note that personally I have
no objections to proceeding with a release of the pristines-on-demand
feature without this branch.

[1] https://lists.apache.org/thread/xmd7x6bx2mrrbw7k5jr1tdmhhrlr9ljc

Regards,
Evgeny Kotkov

Re: Getting to first release of pristines-on-demand feature (#525).

Posted by Evgeny Kotkov via dev <de...@subversion.apache.org>.

Evgeny Kotkov <ev...@visualsvn.com> writes:

> I think that the `pristines-on-demand-on-mwf` branch is now ready for a
> merge to trunk.  I could do that, assuming there are no objections.

Merged in https://svn.apache.org/r1905955

I'm going to respond on the topic of SHA1 a bit later.

Thanks,
Evgeny Kotkov

Re: Getting to first release of pristines-on-demand feature (#525).

Posted by Daniel Sahlberg <da...@gmail.com>.

Evgeny,

Thanks so much for your hard work in pushing this project forward!

I don't think I can contribute much in getting this merged to trunk (from
lack of C experience and lack of time to dig into the inner workings), but
I hope it can be completed!

Kind regards,
Daniel Sahlberg


Den ons 7 dec. 2022 kl 18:10 skrev Evgeny Kotkov via dev <
dev@subversion.apache.org>:

> Evgeny Kotkov <ev...@visualsvn.com> writes:
>
> > > IMHO, once the tests are ready, we could merge it and release
> > > it to the world.
> >
> > Apart from the required test changes, there are some technical
> > TODOs that remain from the initial patch and should be resolved.
> > I'll try to handle them as well.
>
> I think that the `pristines-on-demand-on-mwf` branch is now ready for a
> merge to trunk.  I could do that, assuming there are no objections.
>
>
> https://svn.apache.org/repos/asf/subversion/branches/pristines-on-demand-on-mwf
>
> The branch includes the following:
> – Core implementation of the new mode where required pristines are fetched
>   at the beginning of the operation.
> – A new --store-pristine=yes/no option for `svn checkout` that is persisted
>   as a working copy setting.
> – An update for `svn info` to display the value of this new setting.
> – A standalone test harness that tests main operations in both
>   --store-pristine modes and gets executed on every test run.
> – A new --store-pristine=yes/no option for the test suite that forces all
>   tests to run with a specific pristine mode.
>
> The branch passes all tests in my Windows and Linux environments, in both
> --store-pristine=yes and =no modes.
>
>
> While here, I would like to raise a topic of incorporating a switch from
> SHA1 to a different checksum type (without known collisions) for the new
> working copy format.  This topic is relevant to the pristines-on-demand
> branch, because the new "is the file modified?" check relies on the
> checksum
> comparison, instead of comparing the contents of working and pristine
> files.
>
> And so while I consider it to be out of the scope of the
> pristines-on-demand
> branch, I think that we might want to evaluate if this is something that
> should be a part of the next release.
>
>
> Thanks,
> Evgeny Kotkov
>

Re: Getting to first release of pristines-on-demand feature (#525).

Posted by Karl Fogel <kf...@red-bean.com>.

On 07 Dec 2022, Evgeny Kotkov wrote:
>The branch passes all tests in my Windows and Linux environments, 
>in both
>--store-pristine=yes and =no modes.

FYI, it passes all tests here too (on Debian GNU/Linux, up-to-date 
'testing' distro).  Attached file has details; there were some 
XFAILs, but no FAILs.

Best regards,
-Karl

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format (was: Re: Getting to first release of pristines-on-demand feature (#525).)

Posted by Branko Čibej <br...@apache.org>.

On 20.12.2022 09:14, Evgeny Kotkov wrote:
> 2) We already need a working copy format bump for the pristines-on-demand
>     feature.  So using that format bump to solve the SHA1 issue might reduce
>     the overall number of required bumps for users (assuming that we'll still
>     need to switch from SHA1 at some point later).

Using a new hashing algorithm in the working copy is relatively simple. 
Making such a change backwards-compatible is not. It would be really 
nice if this could be done in a way that allows newer clients to still 
support older working copies without upgrading them; after all, we have 
the infrastructure for this in place now.

-- Brane

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format (was: Re: Getting to first release of pristines-on-demand feature (#525).)

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Evgeny Kotkov via dev wrote on Tue, Dec 20, 2022 at 11:14:00 +0300:
> [Moving discussion to a new thread]
> 
> We currently have a problem that a working copy relies on the checksum type
> with known collisions (SHA1).  A solution to that problem

Why is libsvn_wc's use of SHA-1 a problem?  What's the scenario wherein
Subversion will behave differently than it should?

> is to switch to a different checksum type without known collisions in
> one of the newer working copy formats.

Such as SHA-1 salted by NODES.LOCAL_RELPATH and NODES.WC_ID (or a per-wc UUID)?

> Since we plan on shipping a new working copy format in 1.15, this seems to
> be an appropriate moment of time to decide whether we'd also want to switch
> to a checksum type without known collisions in that new format.
> 

What's the acceptance test we use for candidate checksum algorithms?

You say we should switch to a checksum algorithm that doesn't have known
collisions, but, why should we require that?  Consider the following
160-bit checksum algorithm:
.
    1. If the input consists of 40 ASCII lowercase hex digits and
       nothing else, return the input.
    2. Else, return the SHA-1 of the input.

This algorithm has a trivial first preimage attack.  If a wc used this
identity-then-sha1 algorithm instead of SHA-1, then… what?

> Below are the arguments for including a switch to a different checksum type
> in the working copy format for 1.15:
> 
> 1) Since the "is the file modified?" check now compares checksums, leaving
>    everything as-is may be considered a regression, because it would
>    introduce additional cases where a working copy currently relies on
>    comparing checksums with known collisions.
> 

Well, SHA-1 is still collision-free so long as one is not deliberately
trying to use collisions, so this would only be a regression if we
consider "Deliberately store files that have the same checksum" to be
a use-case.  Do we?

I recall we discussed this when shattered.io was announced, and we
didn't rush to upgrade the checksums we use everywhere, so I guess back
then we came to the conclusion that wasn't a use-case.  (Of course we
can change our opinion; that's just a datapoint, and there may be more,
on both sides, in the old thread.)

I looked for the old thread and didn't find it.  (I looked in the
private@ archives too in case the thread was there.)

> 2) We already need a working copy format bump for the pristines-on-demand
>    feature.  So using that format bump to solve the SHA1 issue might reduce
>    the overall number of required bumps for users (assuming that we'll still
>    need to switch from SHA1 at some point later).
> 

Considering that 1.15 will support reading and writing both f31 and f32,
the "overall number of required bumps" between 1.8 and trunk@HEAD is
zero, meaning the proposed change can't reduce that number.

> 3) While the pristines-on-demand feature is not released, upgrading
>    with a switch to the new checksum type seems to be possible without
>    requiring a network fetch.

I infer the scenario in question here is upgrading a (say) pristinesless
wc to a a newer format that supports a new checksum algorithm.

>    But if some of the pristines are optional, we lose the possibility
>    to rehash all contents in place.  So we might find ourselves having
>    to choose between two worse alternatives of either requiring
>    a network fetch during upgrade or entirely prohibiting an upgrade
>    of working copies with optional pristines.

Why would we want to rehash everything in place?  The 1.15→1.16 upgrade
could simply leave pristineless files' checksums as SHA-1 until the next
«svn up», just like «svnadmin upgrade» of FSFS doesn't retroactively add
SHA-1 checksums to node-rev headers or "-file" or "-dir" indicators in
the changed-paths section.

There may be yet other alternatives.

> Thoughts?

I'm not voting either -0 or +0 at this time.

Cheers,

Daniel

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Karl Fogel wrote on Fri, Jan 20, 2023 at 11:18:56 -0600:
> On 20 Jan 2023, Nathan Hartman wrote:
> > Taking a step back, this discussion started because pristine-free WCs
> > are IIUC more dependent on comparing hashes than pristineful WCs, and
> > therefore a hash collision could have more impact in a pristine-free
> > WC. "Guarantees" were mentioned, but I think it's important to state
> > that there's only a guarantee of probability, since as mentioned above
> > all hashes will have collisions.
> 
> Sure, in a literal mathematical sense, but not in a sense that matters for
> our purposes here.
> 
> In the absence of an intentionally caused collision, a good hash function
> has *far* less chance of accidental collision than, say, the chance that
> your CPU will malfunction due to a stray cosmic ray, or the chance of us
> getting hit by a planet-destroying meteorite tomorrow.
> 
> For our purposes, "guarantee" is accurate.  No guarantee we make can be
> stonger than the inverse probability of a CPU/memory malfunction anyway.
> 

The probability of an accidental collision in a "good" N-bit hash
function is on the order of 1/√2ⁿ, which for sufficiently large N is
considered an acceptable risk.  That's invariant over time, however,
intentionally causing collisions becomes easier over time.

> > We already can't store files with identical SHA1 hashes, but AFAIK the
> > only meaningful impact we've ever heard is that security researchers
> > cannot track files they generate with deliberate collisions. The same
> > would be true with any hash type, for collisions within that hash
> > type.
> 
> Yes.  A hash is considered "broken" the moment security researches can
> generate a collision.
> 

To be clear, is this what you're saying? —
.
    Premise: There is a collision attack against SHA-1.
    Conclusion: Subversion should stop using SHA-1.

This conclusion does not follow from this premise.  For instance, FSFS
checks for collisions, so it can actually use "File length in bytes" as
a checksum and everything would work; the only thing that would change
is that it would not be possible to commit a file that's the same
expanded_size as any other node-rev (including directories).

And, anyway, the burden is not on me to disprove your claim, but on
you to prove it.

> FWIW, in one of my previous posts, I described a real-life scenario in which
> the ability to generate a chosen-plaintext collision in an SVN working copy
> would have security implications.

Yes, and as I have already asked: What other counters to that attack,
besides migrating away from SHA-1, have you considered?  Have you
considered the downsides of migrating away from SHA-1?

Also, /if/ we changed checksums, would that address the attack?  Put
differently, why is a similar attack impossible if we change the
checksum algorithm?  Why is use of SHA-1 a /sine qua non/ of your
scenario?

For example, if we used another checksum algorithm, the attacker from
your scenario might opt to edit the base checksums in .svn/wc.db and
rename the .svn/pristine/ files accordingly.  That's much easier to pull
off, and will be easy to adapt if we change the algorithm again, but on
the other hand, requires write access to the .svn directory and is
easier to discover.

Daniel

> Best regards,
> -Karl

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Karl Fogel wrote on Mon, Jan 30, 2023 at 17:26:03 -0600:
> On 29 Jan 2023, Evgeny Kotkov via dev wrote:
> > I have *absolutely* no idea where "being railroaded through" comes
> > from.  Really, it's a wrong way of portraying and thinking about the
> > events that have happened so far.
> > 
> > Reiterating over those events: I wrote an email containing my
> > thoughts and explaining the motivation for such change.  I didn't
> > reply to some of the questions (including some tricky questions,
> > such as the one featuring a theoretical hash function), because they
> > have been at least partly answered by others in the thread, and I
> > didn't have anything valuable to add at that time.
> > 
> > During that time, I was actively coding the core part of the change,
> > to check if it's possible technically.  Which is important, as far
> > as I believe, because not all theoretically possible solutions can
> > be implemented without facing significant practical or
> > implementation-related issues, and it seems to me that you
> > significantly undervalue such an approach.
> > 
> > I do not say my actions were exemplary, but as far as I can tell,
> > they're pretty much in line with how svn-dev has been operating so
> > far. But, it all resulted in an unclear veto without any _technical_
> > arguments, where what's being vetoed is unclear as well, because the
> > change was not ready at the moment veto got casted.
> > 
> > And because your veto goes in favor of a specific process
> > (considering that no other arguments were given), the only thing
> > that's *actually* being railroaded is an odd form of an RTC
> > (review-then-commit) process that is against our usual CTR
> > (commit-then-review) [1,2].  That's railroading, because it hasn't
> > been explicitly discussed anywhere and a consensus on it has not
> > been reached.
> 
> Daniel, given what's in Evgeny's branch now, could you summarize your
> current technical objections if any?
> 
> If they are something like "This code is solving the wrong problem(s)" or
> "I'm not sure what problem(s) it's supposed to solve", those count as
> technical objections.  It's just that it would be useful to have the
> objection(s) gathered in one place. This thread has been long and somewhat
> digressive -- I'm not saying that's due to you -- and I at least have found
> it a bit difficult to keep track of the concrete objections versus various
> interesting but ultimately theoretical points.
> 

Quoting my other reply just now:

    […] it's pretty simple.  [The OP] said "We should do Y because it
    addresses X".  [The OP] didn't explain why X needs to be addressed, didn't
    consider what alternatives there are to Y, didn't consider any cons that
    Y may have… and when people had questions, [the OP] just began to
    implement Y, without responding to or even acknowledging those
    questions.
    
    That's not how design discussions work.  A design discussion doesn't go
    "state decision; state pros; implement"; it goes "state problem; discuss
    potential solutions, pros, cons; decide; implement" (cf. [4, 5, 6]).
    
    That's why I called veto: not because I considered any particular
    proposal then on the table unreasonable, but because I considered /the
    decision process being used/ unreasonable (cf. [7]).

Concretely: Why would migrating away from SHA-1 be a good thing in the
first place?  Assuming that it /would/ be a good thing, what alternative
ways are there to achieve whatever the goodness may be (new feature /
bugfix / resilience to some attack vector / etc.)?  What are the
potential *downsides* of migrating away from SHA-1?

The same, restated at a higher level of abstraction: "Migrate
away from SHA-1" is a means, not an end.  Define the ends and have
a non-predetermined-outcome discussion on how to achieve them.

"Reduce the security impact to our users of second-preimage attacks
against SHA-1" would be an end.  I don't know whether it's the only one
or whether there are additional ones.

[As to the branch, I'm not sure whether to restate my position on it or
not — so I'll restate it, erring on the side of including too much
rather than too little, but feel free to ignore the following paragraph
at will:]

Was the branch commenced as a PoC / smoke test, to explore one proposed
direction and to be discarded if the consensus compass should end up
pointing towards another cardinal direction?  Or was it commenced on the
assumption that consensus on migrating to SHA-1 to SHA-256 went without
saying, had already formed, or would necessarily have formed by 1.15.0-rc1?

> The reason I'm supportive of Evgeny's direction is that his changes, if
> completed, would offer a solution to the (admittedly still somewhat distant)
> security concern I raised early on. Essentially, I'm worried that
> second-preimage attacks on SHA-1 are coming eventually (maybe I'm wrong
> about this -- they are after all significantly harder than mere collision
> attacks).  *If* such attacks become possible, then our WC could report a
> file as unmodified when in fact it is modified, which would have real
> security implications, as I outlined.
> 

I take it you're referring to this:

    https://mail-archives.apache.org/mod_mbox/subversion-dev/202212.mbox/%3C87k02dr4mn.fsf%40red-bean.com%3E
    I have put WordPress installations under Subversion version control before.
    Once, I detected an attack on one of those WordPress servers when one of the
    things the attacker did was modify some of the WordPress scripts on the
    server.  Those files showed up as modified when I ran 'svn st', and from
    there I ran 'svn diff' and figured out what had happened.  But a
    super-careful attacker could make modifications that leave the
    version-controlled files with the same SHA1 hash they had before, thus
    making it harder to detect the attack.
    
    Yes, I realize there are other ways to detect modifications, and that random
    attackers are unlikely to take the trouble to preserve hashes.  On the other
    hand, a well-resourced spear-fishing attacker who knows something about the
    usage of SVN at their target might indeed try a hash-preserving approach to
    breaking in. The point is, if we're counting on the hashes having certain
    semantics, then our users are counting on it too.  If SHA1 no longer has
    those semantics, we should upgrade.

I offered one alternative counter to that here:

    https://mail-archives.apache.org/mod_mbox/subversion-dev/202301.mbox/%3Cadacbb6f-e0cb-4e5b-8603-0eda19f93b3c%40app.fastmail.com%3E
    So, suppose the wc didn't hardcode _any particular_ hash function for
    naming pristines and for status walks — not md5, not sha1, not sha256 —
    but had each «svn checkout» run pick a hash function uniformly at random
    out of a large enough family of hash functions[1].  (Intuitively, think
    of a family of hash functions as a hash function with a random salt,
    similar to [2].)
    
    This way, even if someone tried to deliberately create a collision, they
    wouldn't be able to pick a collision "off the shelf", as with
    shattered.io; they'd need to compute a collision for the specific hash
    function ("salt") used by that particular wc.  That's more difficult than
    creating a collision in a well-known hash function, regardless of
    whether we treat the salt's value as a secret of the wc (as in, stored
    in a mode-0400 file in under .svn directory and not disclosed to the
    server) or as a value the attacker is assumed to know.
    
    So, that's one way to address [the WordPress scenario].

And analysed the marginal attack difficulty if we change the checksum
algorithm here:

    https://mail-archives.apache.org/mod_mbox/subversion-dev/202301.mbox/%3C20230121102455.GB3174%40tarpaulin.shahaf.local2%3E
    For example, if we used another checksum algorithm, the attacker from
    your scenario might opt to edit the base checksums in .svn/wc.db and
    rename the .svn/pristine/ files accordingly.  That's much easier to pull
    off, and will be easy to adapt if we change the algorithm again, but on
    the other hand, requires write access to the .svn directory and is
    easier to discover.

In any case, even assuming second-preimage attacks against SHA-1 are
something we should assume adversaries capable of [and I'm not
expressing any opinion on this question], it does not /automatically/
follow that we should migrate away from SHA-1:

    https://mail-archives.apache.org/mod_mbox/subversion-dev/202301.mbox/%3C20230121102455.GB3174%40tarpaulin.shahaf.local2%3E
    To be clear, is this what you're saying? —
    .
        Premise: There is a collision attack against SHA-1.
        Conclusion: Subversion should stop using SHA-1.

    This conclusion does not follow from this premise.  For instance, FSFS
    checks for collisions, so it can actually use "File length in bytes" as
    a checksum […]

And to be clear: I'm not saying Subversion should continue using SHA-1,
and I'm not saying that Subversion should stop using SHA-1.  I'm saying
we should consider what the alternatives to that are.

> Like I said, this is far from urgent, and IMHO it certainly should not delay
> a release of our new pristineless feature.  But when and if Evgeny's branch
> is ready (where "ready" presumably includes something other than salted
> SHA-1 as the other checksum option), I would like to see these changes go
> in, unless we identify some harm from them.
> 
> For everyone's ease of reference:
> 
> $ svn cat https://svn.apache.org/repos/asf/subversion/branches/pristine-checksum-kind/BRANCH-README
> 
> $ svn log --stop-on-copy
> https://svn.apache.org/repos/asf/subversion/branches/pristine-checksum-kind/
> 
> Best regards,
> -Karl

Thanks for allowing me the time to write a proper response :)

Daniel

Glossary of attacks (was: Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format)

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Definitions of attacks:

1. Collision attack:
   Given h(),
   find x₁, x₂ such that h(x₁) == h(x₂).

2. Second preimage attack:
   Given h() and x,
   find x′ such that h(x) == h(x′).

3. First preimage attack:
   Given h() and y,
   find x such that h(x) == y.

4. Chosen prefix attack:
   Given h(), p₁, and p₂,
   find m₁, m₂ such that h(m₁) == h(m₂) and m₁.startswith(p₁) and m₂.startswith(p₂).

Daniel Shahaf wrote on Thu, Jan 26, 2023 at 09:33:59 +0000:
> Evgeny Kotkov via dev wrote on Mon, Jan 23, 2023 at 02:28:50 +0300:
> > However, with the feasibility of chosen-prefix attacks on SHA-1 [2], it's
> > probably only a matter of time until the situation becomes worse.
> > 
> 
> Quoting the third hunk of 
> <https://mail-archives.apache.org/mod_mbox/subversion-dev/202212.mbox/%3C20221220201300.GH32332%40tarpaulin.shahaf.local2%3E>:
> 
>     What's the acceptance test we use for candidate checksum algorithms?
>     
>     You say we should switch to a checksum algorithm that doesn't have known
>     collisions, but, why should we require that?  Consider the following
>     160-bit checksum algorithm:
>     .
>         1. If the input consists of 40 ASCII lowercase hex digits and
>            nothing else, return the input.
>         2. Else, return the SHA-1 of the input.
>     
>     This algorithm has a trivial first preimage attack.  If a wc used this
>     identity-then-sha1 algorithm instead of SHA-1, then… what?
> 
> > That could happen after a public disclosure of a pair of executable
> > files/scripts where the forged version allows for remote code execution.
> > Or maybe something similar with a file format that is often stored in
> > repositories and that can be executed or used by a build script, etc.
> > 
> 
> Err, hang on.  Your reference described a chosen-prefix attack, while
> this scenario concerns a single public collision.  These are two
> different things.
> 
> Disclosure of of a pair of executable files/scripts isn't by itself
> a problem unless one of the pair ("file A") is in a repository
> somewhere.  Now, was the colliding file ("file B") generated _before_ or
> _after_ file A was committed?
> 
> - If _before_, then it would seem Mallory had somehow managed to:
> 
>   1. get a file of his choosing committed to Alice's repository; and
> 
>   2. get a wc of Alice's repository into one of the codepaths that
>      assume SHA-1 is one-to-one / collission-free (currently that's the
>      ra_serf optimization and the 1.15 wc status).
> 
>   Now, step #1 seems plausible enough.  As to step #2, it's not clear to
>   me how file B would reach the wc in step #2… but insofar as security
>   assumptions go, it seems reasonable to assume Mallory can make this
>   happen.
> 
>   So, I agree it's a scenario we should address.  What options do we
>   have to address it?  (I grant that migrating away from SHA-1 is one
>   option.)
> 
> - If _after_, then you're presuming not simply a collision attack but
>   a second preimage attack.  Should we assume Mallory to be able to
>   mount a second preimage attack?
> 
> Chosen-prefix collision attacks can help Mallory in a variant of the
> "before" case: Mallory computes a collision, sends file A to Alice (who
> commits it), and invokes his assumed ability to inject file B into
> Alice's wc.  This would work for file formats that ignore the unchosen
> suffix.

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Evgeny Kotkov via dev wrote on Sun, Jan 29, 2023 at 16:36:12 +0300:
> Daniel Shahaf <d....@daniel.shahaf.name> writes:
> 
> > > That could happen after a public disclosure of a pair of executable
> > > files/scripts where the forged version allows for remote code execution.
> > > Or maybe something similar with a file format that is often stored in
> > > repositories and that can be executed or used by a build script, etc.
> > >
> >
> > Err, hang on.  Your reference described a chosen-prefix attack, while
> > this scenario concerns a single public collision.  These are two
> > different things.
> 
> A chosen-prefix attack allows finding more meaningful collisions such as
> working executables/scripts.  When such collisions are made public, they
> would have a greater exploitation potential than just a random collision.
> 

Right.  So we're assuming Mallory generates a chosen-prefix collision,
and then somehow pulls off steps #1 and #2-as-amended [both quoted
below], with Alice noticing none of that.

That still sounds like something we should assume Mallory can pull off.

> > Disclosure of of a pair of executable files/scripts isn't by itself
> > a problem unless one of the pair ("file A") is in a repository
> > somewhere.  Now, was the colliding file ("file B") generated _before_ or
> > _after_ file A was committed?
> >
> > - If _before_, then it would seem Mallory had somehow managed to:
> >
> >   1. get a file of his choosing committed to Alice's repository; and
> >
> >   2. get a wc of Alice's repository into one of the codepaths that
> >      assume SHA-1 is one-to-one / collission-free (currently that's the
> >      ra_serf optimization and the 1.15 wc status).
> 
> Not only.  There are cases when the working copy itself installs the working
> file with a hash lookup in the pristine store.  This is more true for 1.14
> than trunk, because in trunk we have the streamy checkout/update that avoid
> such lookups by writing straight to the working file.  However, some of
> the code paths still install the contents from the pristine store by hash.
> Examples include reverting a file, copying an unmodified file, switching
> a file with keywords, the mentioned ra_serf optimization, and etc.
> 

Thanks.  In terms of that step #2, all these are also candidates for
"one of the codepaths", then.

> >   Now, step #1 seems plausible enough.  As to step #2, it's not clear to
> >   me how file B would reach the wc in step #2…
> 
> If Mallory has write access, she could commit both files, thus arranging for
> a possible content change if both files are checked out to a single working
> copy.  This isn't the same as just directly modifying the target file, because
> file content isn't expected to change due to changes in other files (that can
> be of any type), so this attack has much better chances of being unnoticed.
> 

Well, yes, but the write access requirement lowers severity.

> If Mallory doesn't have write access, there should be other vectors, such
> as distributing a pair of files (harmless in the context of their respective
> file formats) separately via two upstream channels.  Then, if both of the
> upstream distributions are committed into a repository and their files are
> checked out together, the content will change, allowing for a malicious
> action.

I take it we're still under the assumption that someone's repository has
rep-sharing disabled (or unsupported, i.e., pre-1.6 format) despite the
recommendation in security/sha1-advisory.txt, since otherwise the commit
would be rejected.

So, back to my question which you have snipped:

> >   So, I agree it's a scenario we should address.  What options do we
> >   have to address it?  (I grant that migrating away from SHA-1 is one
> >   option.)

Care to address that?

Daniel

> 
> Regards,
> Evgeny Kotkov

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Branko Čibej <br...@apache.org>.

On 18.01.2024 08:43, Daniel Sahlberg wrote:
> As far as I understand, the point of multi-hash is to keep the WC 
> format between versions (so older clients can continue to use the WC). 
> I need some help to understand how that would work in practice. Let's 
> say that 1.15 adds SHAABC, 1.16 adds SHAXYZ. Then 1.17 drops SHA1. But...
> - A 1.17 client will only use SHAABC or SHAXYZ hashes.
> - A 1.16 client can use SHA1, SHAABC and SHAXYZ hashes.
> - A 1.15 client can only use SHA1 and SHAABC hashes.
>
> How can these work together? A WC created in 1.17 can't be used by a 
> 1.15 client and a WC created in 1.15 (with SHA1) can't be used by a 
> 1.17 client. How is this different from bumping the format? How do we 
> detect this?

It's just another dimension of changing the format. When you introduce 
multihash, you have to bump the format number so that clients that don't 
know about it won't try to use the WC. Clients that _do_ know about it 
will have to check which hash algorithm(s) are used in any case.

> At least, we'd need some method of updating the hashes in the 
> database, akin the WC format upgrades in some versions (was it 1.8?).

"svn upgrade" is where this would happen. On the multi-wc-format branch 
(if memory serves), it accepts a target WC version -- which is 
equivalent to the feature set supported by the WC. There's no reason why 
it couldn't also grow a "--force-hash=quantum-entangled" option.

-- Brane

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Evgeny Kotkov via dev <de...@subversion.apache.org>.

Daniel Sahlberg <da...@gmail.com> writes:

> As far as I understand, the point of multi-hash is to keep the WC format
> between versions (so older clients can continue to use the WC).

Just as a minor note, the working copies created using the implementation
on the `pristine-checksum-salt` branch don't multi-hash the contents, but
rather make the [single] used checksum kind configurable and persist it at
the moment when a working copy is created or upgraded.

> I need some help to understand how that would work in practice. Let's say
> that 1.15 adds SHAABC, 1.16 adds SHAXYZ. Then 1.17 drops SHA1. But...
> - A 1.17 client will only use SHAABC or SHAXYZ hashes.
> - A 1.16 client can use SHA1, SHAABC and SHAXYZ hashes.
> - A 1.15 client can only use SHA1 and SHAABC hashes.
>
> How can these work together? A WC created in 1.17 can't be used by a 1.15
> client and a WC created in 1.15 (with SHA1) can't be used by a 1.17 client.
> How is this different from bumping the format? How do we detect this?

In the current design available on the `pristine-checksum-salt` branch, the
supported checksum kinds are tied to a working copy format, and any supported
checksum kind may additionally use a dynamic salt.  For example, format 33
supports only SHA-1 (regular or dynamically salted), but a newer format 34
can add support for another checksum kind such as SHA-2 if necessary.

When an existing working copy is upgraded to a newer format, its current
checksum kind is retained as is (we can't rehash the content in a
`--store-pristine=no` case because the pristines are not available).

I don't know if we'll find ourselves having to forcefully phase out SHA-1
*even* for such working copies that retain an older checksum kind, i.e.,
it might be enough to use the new checksum kind only for freshly created
working copies.  However, there would be a few options to consider:

I think that milder options could include warning the user to check out a
new working copy (that would use a different checksum kind), and a harsher
option could mean adding a new format that doesn't support SHA-1 under
any circumstances, and declaring all previously available working copy
formats unsupported.

Regards,
Evgeny Kotkov

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Daniel Sahlberg <da...@gmail.com>.

@Karl Fogel <kf...@red-bean.com>,  @Evgeny Kotkov
<ev...@visualsvn.com>

Any chance for a comment on the questions in this thread?

I've also added my own comment below.

Kind regards,
Daniel



Den sön 14 jan. 2024 kl 00:56 skrev Nathan Hartman <hartman.nathan@gmail.com
>:

> On Fri, Jan 12, 2024 at 3:51 PM Johan Corveleyn <jc...@gmail.com> wrote:
>
>> On Fri, Jan 12, 2024 at 12:37 PM Daniel Shahaf <d....@daniel.shahaf.name>
>> wrote:
>> ...
>> > Procedurally, the long hiatus is counterproductive.  Neither kfogel nor
>> > I had the context in our heads, and the cache misses took their toll in
>> > tuits and in wallclock time.  Furthermore, I have less spare time for
>> > dev@ discussions than I did when I cast the veto (= a year ago next
>> > Saturday).  Going forward it might be preferable for threads not to
>> > hibernate.
>>
>> I agree, but obviously the hibernation is not some deliberate action
>> by anyone. It's just that most of us here have less spare time for
>> dev@ discussions (and for SVN development) than before. Especially for
>> such complex matters, and especially when people feel there are
>> walking into a minefield. There are only a few active devs left, and
>> tuits are running low ...
>>
>> ...
>> > That being the case, I have considered whether merging the feature
>> > branch outweighs letting dev@ take a not-only-/pro forma/ role in
>> > design discussions.  I am of the opinion that it does not, and
>> > therefore I reäfirrm the veto.
>>
>> It has become more clear to me (I was only following tangentially)
>> that your veto is focused on the development methodology and the lack
>> of design discussion. Is that a valid reason for a veto? We are low on
>> resources, someone still finds time to make some progress, no one
>> blocks it on technical grounds, and then someone vetoes it because we
>> don't have enough resources?
>>
>> That puts us pretty much in deadlock, because we are too low on
>> resources. Or maybe I misunderstand?
>>
>> To be clear: I appreciate your input, Daniel, and your insistence on a
>> more thorough design discussion. I assume it's coming from a genuine
>> concern that we formulate problems well, and think hard about possible
>> solutions (focusing on the precise problem we are trying to solve).
>> But at the end of the day, if that design discussion doesn't happen
>> (or not enough to your satisfaction anyway), is that grounds for a
>> veto? For me it's a tough call, because on the one hand you have a
>> point, but on the other hand ... you're blocking _some_ progress
>> because the process behind it is not perfect (which is hard to do with
>> the 3.25 tuits we have left).
>>
>> > P.S.  Could that BRANCH-README please state what's the problem the
>> branch
>> > means to solve, i.e., the goal / acceptance test?  "Make it possible to
>> > «svn add» SHA-1 collisions"?
>>
>> I agree that would be a good step.
>>
>> I too find it a bit unclear what problem we're actually trying to
>> solve, apart from a vague feeling that SHA-1 will become more and more
>> broken over time, and that this will cause fatal injury to SVN (in its
>> WC, protocol, dump format, or repository). And perhaps the fact that
>> security auditors are becoming more and more triggered by seeing SHA-1
>> (even if they don't understand the way it is used and its
>> ramifications). Making it possible to 'svn add' SHA-1 collisions is
>> not it, I think.
>>
>> --
>> Johan
>>
>
>
> Johan's reply sums up my thoughts pretty closely.
>
> I would very much like to *avoid* all of the following: deadlock, bad
> feelings, and members of this small community leaving because of deadlocks
> or bad feelings.
>
> I agree that (at the very least), BRANCH-README should define what problem
> the branch aims to solve, and perhaps that's really the main thing we need
> to discuss and resolve.
>
> Johan touched on one issue with SHA1: regardless how it is actually used
> in SVN and whether it is adequate for those purposes, there is customer
> perception. I can imagine, for example, the IT dept of some big
> $corporation could blacklist SHA1 because it is considered broken for
> cryptographic purposes. But they could blacklist it for everything. Even
> though it is safe and effective for our use cases, try explaining that to
> an admin who is struggling to meet such a blanket policy.
>
> I would like to add another reason to think about a post-SHA1 future: I'm
> writing on mobile so I can't easily grep for things now, but could our
> dependencies eventually remove the SHA1 implementation? (I just saw
> something about removal of DSA from some famous lib not too long ago. SHA1
> could be next?)
>
> When would SHA1 disappear? I don't know, but I consider it plausible to
> happen in about 5 years.
>
> If SHA1 is removed in the future, there will need to be a mad dash to
> replace it. Or we'll have to add a new dependency to use an alternate
> implementation. Or we'll have to implement our own SHA1 or copy some code
> into SVN. All of these seem bad to me.
>
> Switching to a different hash is also a bad idea, I think, because it is
> likely to suffer the same problems as SHA1 later on, as cryptography
> research proceeds and newer hashes become declared broken.
>
> I'll try to describe what I think is a best case scenario: Support
> multi-hash in 1.15 in format 32 WCs. SHA1 can continue to be the default
> but we should be careful not to require a SHA1 implementation to exist.
> Furthermore, by default "svn checkout" continues to create format 31 WCs
> (this is implemented currently). When new (1.15 and up) servers talk to new
> clients, they'll have to negotiate the "best" common hash for the protocol.
> Over time, we can add other hashes. Over time, distros and package managers
> pick up 1.15. Someday down the line (5 years?), if SHA1 goes away, or an IT
> dept wants to avoid SHA1 for whatever reasons, most of the hard work of
> changing hashes will have been done already and most people will have the
> newer software on their system already. Changing hashes then becomes a
> trivial matter. The same will be true of any future hashes that become
> declared broken, requiring almost no additional work on our part. Notably,
> it will not be necessary to bump the WC or protocol formats because of
> hashes.
>

> Pros: Future-proofing against the real and perceived brokenness of any
> hash types.
>
> Cons: Requires a lot of work up front, which no one might volunteer to do.
>
> We should continue hashing out (pun intended) how to address the different
> concerns raised.
>
> Are there any technical reasons *not* to support other hashes going
> forward?
>
> Are there other pros or cons to supporting a scenario like I described?
>

As far as I understand, the point of multi-hash is to keep the WC format
between versions (so older clients can continue to use the WC). I need some
help to understand how that would work in practice. Let's say that 1.15
adds SHAABC, 1.16 adds SHAXYZ. Then 1.17 drops SHA1. But...
- A 1.17 client will only use SHAABC or SHAXYZ hashes.
- A 1.16 client can use SHA1, SHAABC and SHAXYZ hashes.
- A 1.15 client can only use SHA1 and SHAABC hashes.

How can these work together? A WC created in 1.17 can't be used by a 1.15
client and a WC created in 1.15 (with SHA1) can't be used by a 1.17 client.
How is this different from bumping the format? How do we detect this?

At least, we'd need some method of updating the hashes in the database,
akin the WC format upgrades in some versions (was it 1.8?).

Kind regards,
Daniel

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Nathan Hartman <ha...@gmail.com>.

On Sat, Jan 13, 2024 at 3:56 PM Nathan Hartman <ha...@gmail.com>
wrote:

> Pros: Future-proofing against the real and perceived brokenness of any
> hash types.
>

I meant to write:

Pros: Future-proofing against the real and perceived brokenness of any hash
types, or the deprecation and later removal of their implementations from
our deps.

Cheers,
Nathan

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Nathan Hartman <ha...@gmail.com>.

On Fri, Jan 12, 2024 at 3:51 PM Johan Corveleyn <jc...@gmail.com> wrote:

> On Fri, Jan 12, 2024 at 12:37 PM Daniel Shahaf <d....@daniel.shahaf.name>
> wrote:
> ...
> > Procedurally, the long hiatus is counterproductive.  Neither kfogel nor
> > I had the context in our heads, and the cache misses took their toll in
> > tuits and in wallclock time.  Furthermore, I have less spare time for
> > dev@ discussions than I did when I cast the veto (= a year ago next
> > Saturday).  Going forward it might be preferable for threads not to
> > hibernate.
>
> I agree, but obviously the hibernation is not some deliberate action
> by anyone. It's just that most of us here have less spare time for
> dev@ discussions (and for SVN development) than before. Especially for
> such complex matters, and especially when people feel there are
> walking into a minefield. There are only a few active devs left, and
> tuits are running low ...
>
> ...
> > That being the case, I have considered whether merging the feature
> > branch outweighs letting dev@ take a not-only-/pro forma/ role in
> > design discussions.  I am of the opinion that it does not, and
> > therefore I reäfirrm the veto.
>
> It has become more clear to me (I was only following tangentially)
> that your veto is focused on the development methodology and the lack
> of design discussion. Is that a valid reason for a veto? We are low on
> resources, someone still finds time to make some progress, no one
> blocks it on technical grounds, and then someone vetoes it because we
> don't have enough resources?
>
> That puts us pretty much in deadlock, because we are too low on
> resources. Or maybe I misunderstand?
>
> To be clear: I appreciate your input, Daniel, and your insistence on a
> more thorough design discussion. I assume it's coming from a genuine
> concern that we formulate problems well, and think hard about possible
> solutions (focusing on the precise problem we are trying to solve).
> But at the end of the day, if that design discussion doesn't happen
> (or not enough to your satisfaction anyway), is that grounds for a
> veto? For me it's a tough call, because on the one hand you have a
> point, but on the other hand ... you're blocking _some_ progress
> because the process behind it is not perfect (which is hard to do with
> the 3.25 tuits we have left).
>
> > P.S.  Could that BRANCH-README please state what's the problem the branch
> > means to solve, i.e., the goal / acceptance test?  "Make it possible to
> > «svn add» SHA-1 collisions"?
>
> I agree that would be a good step.
>
> I too find it a bit unclear what problem we're actually trying to
> solve, apart from a vague feeling that SHA-1 will become more and more
> broken over time, and that this will cause fatal injury to SVN (in its
> WC, protocol, dump format, or repository). And perhaps the fact that
> security auditors are becoming more and more triggered by seeing SHA-1
> (even if they don't understand the way it is used and its
> ramifications). Making it possible to 'svn add' SHA-1 collisions is
> not it, I think.
>
> --
> Johan
>

Johan's reply sums up my thoughts pretty closely.

I would very much like to *avoid* all of the following: deadlock, bad
feelings, and members of this small community leaving because of deadlocks
or bad feelings.

I agree that (at the very least), BRANCH-README should define what problem
the branch aims to solve, and perhaps that's really the main thing we need
to discuss and resolve.

Johan touched on one issue with SHA1: regardless how it is actually used in
SVN and whether it is adequate for those purposes, there is customer
perception. I can imagine, for example, the IT dept of some big
$corporation could blacklist SHA1 because it is considered broken for
cryptographic purposes. But they could blacklist it for everything. Even
though it is safe and effective for our use cases, try explaining that to
an admin who is struggling to meet such a blanket policy.

I would like to add another reason to think about a post-SHA1 future: I'm
writing on mobile so I can't easily grep for things now, but could our
dependencies eventually remove the SHA1 implementation? (I just saw
something about removal of DSA from some famous lib not too long ago. SHA1
could be next?)

When would SHA1 disappear? I don't know, but I consider it plausible to
happen in about 5 years.

If SHA1 is removed in the future, there will need to be a mad dash to
replace it. Or we'll have to add a new dependency to use an alternate
implementation. Or we'll have to implement our own SHA1 or copy some code
into SVN. All of these seem bad to me.

Switching to a different hash is also a bad idea, I think, because it is
likely to suffer the same problems as SHA1 later on, as cryptography
research proceeds and newer hashes become declared broken.

I'll try to describe what I think is a best case scenario: Support
multi-hash in 1.15 in format 32 WCs. SHA1 can continue to be the default
but we should be careful not to require a SHA1 implementation to exist.
Furthermore, by default "svn checkout" continues to create format 31 WCs
(this is implemented currently). When new (1.15 and up) servers talk to new
clients, they'll have to negotiate the "best" common hash for the protocol.
Over time, we can add other hashes. Over time, distros and package managers
pick up 1.15. Someday down the line (5 years?), if SHA1 goes away, or an IT
dept wants to avoid SHA1 for whatever reasons, most of the hard work of
changing hashes will have been done already and most people will have the
newer software on their system already. Changing hashes then becomes a
trivial matter. The same will be true of any future hashes that become
declared broken, requiring almost no additional work on our part. Notably,
it will not be necessary to bump the WC or protocol formats because of
hashes.

Pros: Future-proofing against the real and perceived brokenness of any hash
types.

Cons: Requires a lot of work up front, which no one might volunteer to do.

We should continue hashing out (pun intended) how to address the different
concerns raised.

Are there any technical reasons *not* to support other hashes going forward?

Are there other pros or cons to supporting a scenario like I described?

Thanks,
Nathan

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Daniel Sahlberg <da...@gmail.com>.

Den lör 13 jan. 2024 kl 00:50 skrev Johan Corveleyn <jc...@gmail.com>:

> On Fri, Jan 12, 2024 at 12:37 PM Daniel Shahaf <d....@daniel.shahaf.name>
> wrote:
> ...
> > Procedurally, the long hiatus is counterproductive.  Neither kfogel nor
> > I had the context in our heads, and the cache misses took their toll in
> > tuits and in wallclock time.  Furthermore, I have less spare time for
> > dev@ discussions than I did when I cast the veto (= a year ago next
> > Saturday).  Going forward it might be preferable for threads not to
> > hibernate.
>
> I agree, but obviously the hibernation is not some deliberate action
> by anyone. It's just that most of us here have less spare time for
> dev@ discussions (and for SVN development) than before. Especially for
> such complex matters, and especially when people feel there are
> walking into a minefield. There are only a few active devs left, and
> tuits are running low ...
>

I agree with Johan on this. The long hiatus is unfortunate. But it won't
help to point fingers at this point.



>
> ...
> > That being the case, I have considered whether merging the feature
> > branch outweighs letting dev@ take a not-only-/pro forma/ role in
> > design discussions.  I am of the opinion that it does not, and
> > therefore I reäfirrm the veto.
>
> It has become more clear to me (I was only following tangentially)
> that your veto is focused on the development methodology and the lack
> of design discussion. Is that a valid reason for a veto? We are low on
> resources, someone still finds time to make some progress, no one
> blocks it on technical grounds, and then someone vetoes it because we
> don't have enough resources?
>
> That puts us pretty much in deadlock, because we are too low on
> resources. Or maybe I misunderstand?
>
> To be clear: I appreciate your input, Daniel, and your insistence on a
> more thorough design discussion. I assume it's coming from a genuine
> concern that we formulate problems well, and think hard about possible
> solutions (focusing on the precise problem we are trying to solve).
> But at the end of the day, if that design discussion doesn't happen
> (or not enough to your satisfaction anyway), is that grounds for a
> veto? For me it's a tough call, because on the one hand you have a
> point, but on the other hand ... you're blocking _some_ progress
> because the process behind it is not perfect (which is hard to do with
> the 3.25 tuits we have left).
>
> > P.S.  Could that BRANCH-README please state what's the problem the branch
> > means to solve, i.e., the goal / acceptance test?  "Make it possible to
> > «svn add» SHA-1 collisions"?
>
> I agree that would be a good step.
>
> I too find it a bit unclear what problem we're actually trying to
> solve, apart from a vague feeling that SHA-1 will become more and more
> broken over time, and that this will cause fatal injury to SVN (in its
> WC, protocol, dump format, or repository). And perhaps the fact that
> security auditors are becoming more and more triggered by seeing SHA-1
> (even if they don't understand the way it is used and its
> ramifications). Making it possible to 'svn add' SHA-1 collisions is
> not it, I think.
>

I also agree with this.

From what I remember of the dicsussions earlier there were concerns that a
changed file might go undetected if someone change it to another file with
a collision with the original file. I think that might be a vaild point,
especially if we don't have the pristine files anymore.

I'd also like to understand why we need the multi-checksum format instead
of just plainly switching to XXX (insert favourite checksuming algorithm
here). Does it help us to have multiple types of checksums available? Would
we use BOTH as a resort (likelyhood of collision in SHA1 and in XXX at the
same time approaching zero)? Does it help backwards/forwards compatibility?

Kind regards,
Daniel Sahlberg

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Johan Corveleyn <jc...@gmail.com>.

On Fri, Jan 12, 2024 at 12:37 PM Daniel Shahaf <d....@daniel.shahaf.name> wrote:
...
> Procedurally, the long hiatus is counterproductive.  Neither kfogel nor
> I had the context in our heads, and the cache misses took their toll in
> tuits and in wallclock time.  Furthermore, I have less spare time for
> dev@ discussions than I did when I cast the veto (= a year ago next
> Saturday).  Going forward it might be preferable for threads not to
> hibernate.

I agree, but obviously the hibernation is not some deliberate action
by anyone. It's just that most of us here have less spare time for
dev@ discussions (and for SVN development) than before. Especially for
such complex matters, and especially when people feel there are
walking into a minefield. There are only a few active devs left, and
tuits are running low ...

...
> That being the case, I have considered whether merging the feature
> branch outweighs letting dev@ take a not-only-/pro forma/ role in
> design discussions.  I am of the opinion that it does not, and
> therefore I reäfirrm the veto.

It has become more clear to me (I was only following tangentially)
that your veto is focused on the development methodology and the lack
of design discussion. Is that a valid reason for a veto? We are low on
resources, someone still finds time to make some progress, no one
blocks it on technical grounds, and then someone vetoes it because we
don't have enough resources?

That puts us pretty much in deadlock, because we are too low on
resources. Or maybe I misunderstand?

To be clear: I appreciate your input, Daniel, and your insistence on a
more thorough design discussion. I assume it's coming from a genuine
concern that we formulate problems well, and think hard about possible
solutions (focusing on the precise problem we are trying to solve).
But at the end of the day, if that design discussion doesn't happen
(or not enough to your satisfaction anyway), is that grounds for a
veto? For me it's a tough call, because on the one hand you have a
point, but on the other hand ... you're blocking _some_ progress
because the process behind it is not perfect (which is hard to do with
the 3.25 tuits we have left).

> P.S.  Could that BRANCH-README please state what's the problem the branch
> means to solve, i.e., the goal / acceptance test?  "Make it possible to
> «svn add» SHA-1 collisions"?

I agree that would be a good step.

I too find it a bit unclear what problem we're actually trying to
solve, apart from a vague feeling that SHA-1 will become more and more
broken over time, and that this will cause fatal injury to SVN (in its
WC, protocol, dump format, or repository). And perhaps the fact that
security auditors are becoming more and more triggered by seeing SHA-1
(even if they don't understand the way it is used and its
ramifications). Making it possible to 'svn add' SHA-1 collisions is
not it, I think.

-- 
Johan

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Nathan Hartman <ha...@gmail.com>.

On Thu, Feb 1, 2024 at 5:26 PM Daniel Sahlberg
<da...@gmail.com> wrote:
>
> Gentlemen,
>
> It seems you have both had your say in what flaws there has been in the process. Can we please leave this part of the discussion and continue on the technical issues? I'd hate for this discussion to turn to pie-throwing where someone in the end feel offended and leave the community. We are such a small community and we can't afford to lose someone just because an argument turns toxic (it has happened before so let's make sure it doesn't happen again, please).

I completely agree. Yes, there has been disagreement about process,
but it is counterproductive to debate that anymore. Let's focus on the
technical question and try to reach some consensus on what (if
anything) to do.

> As for the technical side, can we break down the current status and the desired future status to some points and then look at what options we have for solutions?
>
> Currently we use SHA1, which have known attacks. What are the risks?
> - It has been argued that `svn st` will, especially with no-pristines, be extra vulnerable to not detecting a modified file if someone can create a collision with the checksum of the original file
> - Someone also argued that a software could potentially be banned just because it uses a checksum with a known attack, even if the checksum isn't used in a security critical way.

I was the one who spoke about that possibility.

Just one example: NIST has already recommended federal agencies to
stop using SHA-1 for "signatures and other operations threatened by
collision attacks" and by 31 Dec 2030 NIST will publish "a revision of
FIPS 180 that removes the SHA-1 specification" and "Modules that still
use SHA-1 after 2030 will not be permitted for purchase by the federal
government." All those quotes are taken from [1], which was one of the
top hits in a recent DuckDuckGo search. (I don't remember the exact
search.)

Now, even if SVN's use cases of SHA1 are agreed by the developers to
be completely safe, I think it is a real possibility that some sites
could ban SVN because they consider SHA1 a banned algorithm, and even
if we explain that SVN's use of SHA1 is completely safe, those
explanations might not be acceptable in those settings, even if we are
right.

Given the way technology is used, understood, and sometimes (often?)
misunderstood, I can imagine a ridiculous scenario in which Subversion
could use 8-bit CRC, but not SHA1, even though SHA1 is much stronger
than 8-bit CRC, just because SHA1 is "banned" and 8-bit CRC is not.

> What options do we have and how do they mitigate the above risks?> - Evgeny has already shown a possible solution with a salted hash (keeping SHA-1).
> - Can we switch to another hash function completely and does it offer any benefits compared to the salted SHA-1?
> - Should we even do both?
>
> Any other points?
>
> Any thoughts?
>
> I would like to see this thread progress and I hope we can find consensus on a way forward.
>
> Kind regards,
> Daniel Sahlberg

I, too, hope the community can come together and reach a consensus,
whatever that ends up being.

[1] https://www.securityweek.com/nist-retire-27-year-old-sha-1-cryptographic-algorithm/

Cheers,
Nathan

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Daniel Sahlberg <da...@gmail.com>.

Gentlemen,

It seems you have both had your say in what flaws there has been in the
process. Can we please leave this part of the discussion and continue on
the technical issues? I'd hate for this discussion to turn to pie-throwing
where someone in the end feel offended and leave the community. We are such
a small community and we can't afford to lose someone just because an
argument turns toxic (it has happened before so let's make sure it doesn't
happen again, please).

As for the technical side, can we break down the current status and the
desired future status to some points and then look at what options we have
for solutions?

Currently we use SHA1, which have known attacks. What are the risks?
- It has been argued that `svn st` will, especially with no-pristines, be
extra vulnerable to not detecting a modified file if someone can create a
collision with the checksum of the original file
- Someone also argued that a software could potentially be banned just
because it uses a checksum with a known attack, even if the checksum isn't
used in a security critical way.

What options do we have and how do they mitigate the above risks?
- Evgeny has already shown a possible solution with a salted hash (keeping
SHA-1).
- Can we switch to another hash function completely and does it offer any
benefits compared to the salted SHA-1?
- Should we even do both?

Any other points?

Any thoughts?

I would like to see this thread progress and I hope we can find consensus
on a way forward.

Kind regards,
Daniel Sahlberg


Den tors 18 jan. 2024 kl 14:36 skrev Evgeny Kotkov via dev <
dev@subversion.apache.org>:

> Daniel Shahaf <d....@daniel.shahaf.name> writes:
>
> > Procedurally, the long hiatus is counterproductive.
>
> This reminds me that the substantive discussion of your veto ended with my
> email from 8 Feb 2023 that had four direct questions to you and was left
> without an answer:
>
> ``````
>   > That's not how design discussions work.  A design discussion doesn't go
>   > "state decision; state pros; implement"; it goes "state problem;
> discuss
>   > potential solutions, pros, cons; decide; implement" (cf. [4, 5, 6]).
>
>   Well, I think it may not be as simple as it seems to you.  Who decided
> that
>   we should follow the process you're describing?  Is there a thread with a
>   consensus on this topic?  Or do you insist on using this specific process
>   because it's the only process that seems obvious to you?  What
> alternatives
>   to it have been considered?
>
>   As far as I can tell, the process you're suggesting is effectively a
>   waterfall-like process, and there are quite a lot of concerns about its
>   effectiveness, because the decisions have to be made in the conditions of
>   a lack of information.
> ``````
>
> It's been more than 11 months since that email, and those questions still
> don't have an answer.  So if we are to resume this discussion, let's do it
> from the proper point.
>
> > You guys are welcome to try to /convince/ me to change my opinion, or to
> > have the veto invalidated.  In either case, you will be more likely to
> > succeed should your arguments relate not only to the veto's implications
> > but also to its /sine qua non/ component: its rationale.
>
> Just in case, my personal opinion here is that the veto is invalid.
>
> Firstly, based on my understanding, the ASF rules prohibit casting a veto
> without an appropriate technical justification (see [1], which I personally
> agree with).  Secondly, it seems that the process you are imposing hasn't
> been
> accepted in this community.  As far as I know, this topic was tangentially
> discussed before (see [2], for example), and it looks like there hasn't
> been
> a consensus to change our current Commit-Then-Review process into some
> sort of Review-Then-Commit.
>
> (At the same time I won't even try to /convince/ you, sorry.)
>
> [1] https://www.apache.org/foundation/voting.html
> [2] https://lists.apache.org/thread/ow2x68g2k4lv2ycr81d14p8r8w2jj1xl
>
>
> Regards,
> Evgeny Kotkov
>

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Evgeny Kotkov via dev <de...@subversion.apache.org>.

Daniel Shahaf <d....@daniel.shahaf.name> writes:

> Procedurally, the long hiatus is counterproductive.

This reminds me that the substantive discussion of your veto ended with my
email from 8 Feb 2023 that had four direct questions to you and was left
without an answer:

``````
  > That's not how design discussions work.  A design discussion doesn't go
  > "state decision; state pros; implement"; it goes "state problem; discuss
  > potential solutions, pros, cons; decide; implement" (cf. [4, 5, 6]).

  Well, I think it may not be as simple as it seems to you.  Who decided that
  we should follow the process you're describing?  Is there a thread with a
  consensus on this topic?  Or do you insist on using this specific process
  because it's the only process that seems obvious to you?  What alternatives
  to it have been considered?

  As far as I can tell, the process you're suggesting is effectively a
  waterfall-like process, and there are quite a lot of concerns about its
  effectiveness, because the decisions have to be made in the conditions of
  a lack of information.
``````

It's been more than 11 months since that email, and those questions still
don't have an answer.  So if we are to resume this discussion, let's do it
from the proper point.

> You guys are welcome to try to /convince/ me to change my opinion, or to
> have the veto invalidated.  In either case, you will be more likely to
> succeed should your arguments relate not only to the veto's implications
> but also to its /sine qua non/ component: its rationale.

Just in case, my personal opinion here is that the veto is invalid.

Firstly, based on my understanding, the ASF rules prohibit casting a veto
without an appropriate technical justification (see [1], which I personally
agree with).  Secondly, it seems that the process you are imposing hasn't been
accepted in this community.  As far as I know, this topic was tangentially
discussed before (see [2], for example), and it looks like there hasn't been
a consensus to change our current Commit-Then-Review process into some
sort of Review-Then-Commit.

(At the same time I won't even try to /convince/ you, sorry.)

[1] https://www.apache.org/foundation/voting.html
[2] https://lists.apache.org/thread/ow2x68g2k4lv2ycr81d14p8r8w2jj1xl

Regards,
Evgeny Kotkov

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Karl Fogel wrote on Wed, 03 Jan 2024 22:13 +00:00:
> On 01 Apr 2023, Evgeny Kotkov via dev wrote:
> > Daniel Shahaf <d....@daniel.shahaf.name> writes:
> > 
> > > What's the question or action item to/for me?  Thanks.
> > 
> > I'm afraid I don't fully understand your question.  As you
> > probably remember, the change is blocked by your veto.  To my
> > knowledge, this veto hasn't been revoked as of now, and I simply
> > mentioned that in my email.  It is entirely your decision
> > whether or not to take any action regarding this matter.
> 
> So AIUI, Evgeny is asking you to withdraw your veto, Daniel. Evgeny would
> like to merge this into trunk -- on the grounds, I believe, that it is
> strictly an improvement over what we have now, and it opens the door to
> further future improvements (each of which would go through the usual
> discussion & consensus process, of course).

So, I looked.

This thread comprises 237 posts spanning 30 months (July 2021 through
today).  On 2023-01-20 I cast a veto.  There was some activity
afterwards, but until the parent post of this one, the thread has been
silent for the better part of a year; and now I'm being asked to
withdraw my veto.

Procedurally, the long hiatus is counterproductive.  Neither kfogel nor
I had the context in our heads, and the cache misses took their toll in
tuits and in wallclock time.  Furthermore, I have less spare time for
dev@ discussions than I did when I cast the veto (= a year ago next
Saturday).  Going forward it might be preferable for threads not to
hibernate.

You didn't link the veto, so I had to go grep for it.  It is,
presumably, this one:

>>>> # Archived-At: https://mail-archives.apache.org/mod_mbox/subversion-dev/202212.mbox/%3C904aded6-5ef0-4123-ade0-e23a3bb56726%40app.fastmail.com%3E
>>>> Date: Fri, 20 Jan 2023 12:15:24 +0000
>>>> From: Daniel Shahaf
>>>> To: dev@subversion.apache.org
>>>> Subject: Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format
>>>> Message-Id: <90...@app.fastmail.com>
>>>> 
>>>> Evgeny Kotkov via dev wrote on Thu, 19 Jan 2023 18:52 +00:00:
>>>> > I can complete the work on this branch and bring it to a production-ready
>>>> > state, assuming there are no objections.
>>>> 
>>>> Your assumption is counterfactual:
>>>> 
>>>> https://mail-archives.apache.org/mod_mbox/subversion-dev/202301.mbox/%3C20230119152001.GA27446%40tarpaulin.shahaf.local2%3E
>>>> 
>>>> https://mail-archives.apache.org/mod_mbox/subversion-dev/202212.mbox/%3CCAMHy98NqYBLZaTL5-FAbf24RR6bagPN1npC5gsZenewZb0-EuQ%40mail.gmail.com%3E
>>>> 
>>>> Objections have been raised, been left unanswered, and now
>>>> implementation work has commenced following the original design.  That's
>>>> not acceptable.  I'm vetoing the change until a non-rubber-stamp design
>>>> discussion has been completed on the public dev@ list.

So, this veto being in front of me, let me reply to the request that
I withdraw it:

> So AIUI, Evgeny is asking you to withdraw your veto, Daniel. Evgeny would
> like to merge this into trunk -- on the grounds, I believe, that it is
> strictly an improvement over what we have now, and it opens the door to
> further future improvements (each of which would go through the usual
> discussion & consensus process, of course).
> 
> Evgeny's work is on this branch...
> 
> https://svn.apache.org/repos/asf/subversion/branches/pristine-checksum-salt
> 
> ...which in turn branched from
> https://svn.apache.org/repos/asf/subversion/branches/pristine-checksum-kind.
> 
> I used this command to get an overview of the work:
> 
> $ svn cat https://svn.apache.org/repos/asf/subversion/branches/pristine-checksum-salt/BRANCH-README

As far as I can tell, the request for veto withdrawal is grounded only
in the fact that the veto, whilst in force, prevents the feature branch
from being merged/released.  The request does not allege the veto was
invalid or unfounded in the first place; nor that the veto has /become/
invalid or unfounded due to time having passed; nor that modifications
or alterations to the code [or, in this case, to the decision-making
process] have been made and are believed to have addressed the veto's
grounds.

In summary, the request only deals with the fact of a veto and its
formal/procedural implications, but does not deal with the substantive
justification for the veto at all.

That being the case, I have no reason to believe the original grounds of
the veto have been addressed.

That being the case, I have considered whether merging the feature
branch outweighs letting dev@ take a not-only-/pro forma/ role in
design discussions.  I am of the opinion that it does not, and
therefore I reäfirrm the veto.

You guys are welcome to try to /convince/ me to change my opinion, or to
have the veto invalidated.  In either case, you will be more likely to
succeed should your arguments relate not only to the veto's implications
but also to its /sine qua non/ component: its rationale.

Before I salutate this post, I wish to point out that it's rather
ironic — or perhaps I should say /alarming/ — that the request for veto
withdrawal does not deal with the substantive grounds for the veto,
considering those grounds were "dev@ isn't being listened to".  In fact,
this is so inconsistent with the past 15+ years of kfogel interactions
that I feel I should ask whoever happens to live closest to kfogel's if
they would be so very kind as to pop over there, knock on the front
door, and tell him his email is being impersonated.  (Naturally, make
sure it's actually him at the door, first. :P)

Cheers,

Daniel

P.S.  Could that BRANCH-README please state what's the problem the branch
means to solve, i.e., the goal / acceptance test?  "Make it possible to
«svn add» SHA-1 collisions"?

> Evgeny's work is on this branch...
> 
> https://svn.apache.org/repos/asf/subversion/branches/pristine-checksum-salt
> 
> ...which in turn branched from
> https://svn.apache.org/repos/asf/subversion/branches/pristine-checksum-kind.
> 
> I used this command to get an overview of the work:
> 
> $ svn cat https://svn.apache.org/repos/asf/subversion/branches/pristine-checksum-salt/BRANCH-README
> 
> (The work is several months old now, but for the sake of discussion let's
> assume it's mergeable, passes all tests, etc. Obviously, Evgeny's only going
> to merge it when all of those conditions are true -- maybe some minor tweaks
> will be needed to get it there, I don't know.)

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Karl Fogel <kf...@red-bean.com>.

On 04 Jan 2024, Daniel Shahaf wrote:
>Acknowledging receipt.  I'll reply substantively when I have the 
>time to swap in the context.

Thanks.  Yeah, I went through the same context-swapping-in process 
yesterday before posting!

Best regards,
-Karl

>> Evgeny's work is on this branch...
>>
>> https://svn.apache.org/repos/asf/subversion/branches/pristine-checksum-salt
>>
>> ...which in turn branched from 
>> https://svn.apache.org/repos/asf/subversion/branches/pristine-checksum-kind.
>>
>> I used this command to get an overview of the work:
>>
>> $ svn cat 
>> https://svn.apache.org/repos/asf/subversion/branches/pristine-checksum-salt/BRANCH-README
>>
>> (The work is several months old now, but for the sake of 
>> discussion let's assume it's mergeable, passes all tests, etc. 
>> Obviously, Evgeny's only going to merge it when all of those 
>> conditions are true -- maybe some minor tweaks will be needed 
>> to 
>> get it there, I don't know.)
>>
>> Best regards,
>> -Karl

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Karl Fogel wrote on Wed, 03 Jan 2024 22:13 +00:00:
> On 01 Apr 2023, Evgeny Kotkov via dev wrote:
>>Daniel Shahaf <d....@daniel.shahaf.name> writes:
>>
>>> What's the question or action item to/for me?  Thanks.
>>
>>I'm afraid I don't fully understand your question.  As you
>>probably remember, the change is blocked by your veto.  To my
>>knowledge, this veto hasn't been revoked as of now, and I simply
>>mentioned that in my email.  It is entirely your decision
>>whether or not to take any action regarding this matter.
>
> So AIUI, Evgeny is asking you to withdraw your veto, Daniel. 
> Evgeny would like to merge this into trunk -- on the grounds, I 
> believe, that it is strictly an improvement over what we have now, 
> and it opens the door to further future improvements (each of 
> which would go through the usual discussion & consensus process, 
> of course).
>

Acknowledging receipt.  I'll reply substantively when I have the time to swap in the context.

Daniel

> Evgeny's work is on this branch...
>
> https://svn.apache.org/repos/asf/subversion/branches/pristine-checksum-salt
>
> ...which in turn branched from 
> https://svn.apache.org/repos/asf/subversion/branches/pristine-checksum-kind.
>
> I used this command to get an overview of the work:
>
> $ svn cat 
> https://svn.apache.org/repos/asf/subversion/branches/pristine-checksum-salt/BRANCH-README
>
> (The work is several months old now, but for the sake of 
> discussion let's assume it's mergeable, passes all tests, etc. 
> Obviously, Evgeny's only going to merge it when all of those 
> conditions are true -- maybe some minor tweaks will be needed to 
> get it there, I don't know.)
>
> Best regards,
> -Karl

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Karl Fogel <kf...@red-bean.com>.

On 01 Apr 2023, Evgeny Kotkov via dev wrote:
>Daniel Shahaf <d....@daniel.shahaf.name> writes:
>
>> What's the question or action item to/for me?  Thanks.
>
>I'm afraid I don't fully understand your question.  As you
>probably remember, the change is blocked by your veto.  To my
>knowledge, this veto hasn't been revoked as of now, and I simply
>mentioned that in my email.  It is entirely your decision
>whether or not to take any action regarding this matter.

So AIUI, Evgeny is asking you to withdraw your veto, Daniel. 
Evgeny would like to merge this into trunk -- on the grounds, I 
believe, that it is strictly an improvement over what we have now, 
and it opens the door to further future improvements (each of 
which would go through the usual discussion & consensus process, 
of course).

Evgeny's work is on this branch...

https://svn.apache.org/repos/asf/subversion/branches/pristine-checksum-salt

...which in turn branched from 
https://svn.apache.org/repos/asf/subversion/branches/pristine-checksum-kind.

I used this command to get an overview of the work:

$ svn cat 
https://svn.apache.org/repos/asf/subversion/branches/pristine-checksum-salt/BRANCH-README

(The work is several months old now, but for the sake of 
discussion let's assume it's mergeable, passes all tests, etc. 
Obviously, Evgeny's only going to merge it when all of those 
conditions are true -- maybe some minor tweaks will be needed to 
get it there, I don't know.)

Best regards,
-Karl

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Evgeny Kotkov via dev <de...@subversion.apache.org>.

Daniel Shahaf <d....@daniel.shahaf.name> writes:

> What's the question or action item to/for me?  Thanks.

I'm afraid I don't fully understand your question.  As you probably remember,
the change is blocked by your veto.  To my knowledge, this veto hasn't been
revoked as of now, and I simply mentioned that in my email.  It is entirely
your decision whether or not to take any action regarding this matter.


Thanks,
Evgeny Kotkov

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Evgeny Kotkov via dev wrote on Wed, 22 Mar 2023 15:23 +00:00:
> This change is still being blocked by a veto, but if danielsh changes his
> mind and if there won't be other objections, I'm ready to complete the few
> remaining bits and merge it to trunk.

What's the question or action item to/for me?  Thanks.

Daniel

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Evgeny Kotkov via dev <de...@subversion.apache.org>.

Evgeny Kotkov <ev...@visualsvn.com> writes:

> > Now, how hard would this be to actually implement?
>
> To have a more or less accurate estimate, I went ahead and prepared the
> first-cut implementation of an approach that makes the pristine checksum
> kind configurable in a working copy.
>
> The current implementation passes all tests in my environment and seems to
> work in practice.  It is available on the branch:
>
>   https://svn.apache.org/repos/asf/subversion/branches/pristine-checksum-kind
>
> The implementation on the branch allows creating working copies that use a
> checksum kind other than SHA-1.

I extended the current implementation to use a dynamically salted SHA-1
checksum, rather than a SHA-1 with a statically hardcoded salt.
The dynamic salt is generated during the creation of a wc.db.

The implementation is available on a separate branch:

  https://svn.apache.org/repos/asf/subversion/branches/pristine-checksum-salt

The change is a bit massive, but in the meantime I think that it should solve
the potential problem without any practical drawbacks, except for the lack
of the mentioned ra_serf fetch optimization.

So overall I'd propose to bring this change to trunk, to improve the current
state around checksum collisions in the working copy, and to also have the
infrastructure for supporting different checksum kinds in place, in case
we need it in the future.

This change is still being blocked by a veto, but if danielsh changes his
mind and if there won't be other objections, I'm ready to complete the few
remaining bits and merge it to trunk.

Thanks,
Evgeny Kotkov

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

To be clear, I wasn't vetoing changing the hash algorithm.  I was
vetoing making a change without discussion.  If there is discussion and
it results in consensus to change the algorithm, that'll be absolutely
fine by me.

Daniel

Karl Fogel wrote on Sat, 21 Jan 2023 17:58 +00:00:
> *nod* This issue isn't important enough to me to continue the 
> conversation -- I'd like for new hash algorithms to be possible, 
> and I think Evgeny's work on it is worthwhile, but I don't feel 
> nearly as strongly about this as I feel about making the new 
> pristineless working copies available in an official release as 
> soon as we can.
>
> Best regards,
> -Karl
>
> On 21 Jan 2023, Daniel Shahaf wrote:
>>Karl Fogel wrote on Fri, Jan 20, 2023 at 11:09:11 -0600:
>>> On 20 Jan 2023, Daniel Shahaf wrote:
>>> > Evgeny Kotkov via dev wrote on Thu, 19 Jan 2023 18:52 +00:00:
>>> > > I can complete the work on this branch and bring it to a
>>> > > production-ready
>>> > > state, assuming there are no objections.
>>> > 
>>> > Your assumption is counterfactual:
>>> > 
>>> > https://mail-archives.apache.org/mod_mbox/subversion-dev/202301.mbox/%3C20230119152001.GA27446%40tarpaulin.shahaf.local2%3E
>>> > 
>>> > https://mail-archives.apache.org/mod_mbox/subversion-dev/202212.mbox/%3CCAMHy98NqYBLZaTL5-FAbf24RR6bagPN1npC5gsZenewZb0-EuQ%40mail.gmail.com%3E
>>> > 
>>> > Objections have been raised, been left unanswered, and now
>>> > implementation work has commenced following the original 
>>> > design. That's
>>> > not acceptable.
>>> 
>>> I'm a little surprised by your reaction.
>>> 
>>> It is never "not acceptable" for someone to do implementation 
>>> work on a
>>> branch while a discussion is happening, even if that discussion 
>>> contains
>>> objections to or questions about the premise of the branch 
>>> work.
>>> 
>>> It's a branch.  He didn't merge it to trunk, and he posted it 
>>> as an explicit
>>> invitation for discussion.
>>> 
>>
>>I didn't object to the use of a branch /per se/.  I objected to 
>>the
>>treating of objections that *had already been posted* as though 
>>they had
>>never been posted.  *That's* not acceptable.
>>
>>However, since you ask, I don't think implementing a proposal on
>>a branch is necessarily a good idea:
>>
>>- If the branch is seen and presented as a PoC for furthering 
>>discussion
>>  and for discovering practical considerations (e.g., that
>>  PRISTINE.MD5_CHECKSUM docstring I found yesterday during 
>>  discussion,
>>  or the ra_serf sha1 optimization that anyone implementing the 
>>  branch
>>  would run into), it's likely a good thing.
>>  
>>- On the other hand, when the branch implements the original 
>>proposal,
>>  whilst outstanding questions were not only not answered but 
>>  also not
>>  acknowledged, that's quite another thing.  It can result in:
>>
>>  + The branch maintainer being biased in favour of the approach 
>>  they
>>    have implemented.  (People tend not to argue against what 
>>    they have
>>    expended resources on.  Cf. plan continuation bias, sunk cost
>>    fallacy.)
>>
>>  + dev@ being biased towards the approach that has been 
>>  implemented
>>    (because it's a known entity; because no one is volunteering 
>>    to
>>    implement another approach; because there's a desire to cut
>>    a minor release soon…).  This, in turn, can result in…
>>  
>>  + …an incentive for participants *not* to hold open design
>>    discussions on dev@ in the first place.
>>
>>> > I'm vetoing the change until a non-rubber-stamp design
>>> > discussion has been completed on the public dev@ list.
>>> 
>>> Starting an implementation on a branch is a valuable 
>>> contribution to a
>>> design discussion -- it's exactly the kind of 
>>> "non-rubber-stamp"
>>> contribution one would want.
>>> 
>>
>>You're just repeating what you said above.
>>
>>> If you want to re-iterate points you've made that have been 
>>> left unanswered,
>>> that would be a useful contribution -- perhaps some of those 
>>> points will be
>>> updated now that there's actual code, or perhaps they won't. 
>>> Either way,
>>> what Evgeny is doing here seems very constructive to me, and 
>>> entirely within
>>> the normal range of how we do things.
>>
>>Posting a paragraph such as the one I'm replying to is not 
>>"entirely
>>within the normal range of how we do things".  As to my points, 
>>see
>><https://mail-archives.apache.org/mod_mbox/subversion-dev/202301.mbox/%3C20230119152001.GA27446%40tarpaulin.shahaf.local2%3E>.
>>They boil down to this:
>>
>>    <alice> We should migrate away from SHA-1.
>>    <bob> Why?
>>
>>Daniel
>>
>>> Best regards,
>>> -Karl

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Karl Fogel <kf...@red-bean.com>.

*nod* This issue isn't important enough to me to continue the 
conversation -- I'd like for new hash algorithms to be possible, 
and I think Evgeny's work on it is worthwhile, but I don't feel 
nearly as strongly about this as I feel about making the new 
pristineless working copies available in an official release as 
soon as we can.

Best regards,
-Karl

On 21 Jan 2023, Daniel Shahaf wrote:
>Karl Fogel wrote on Fri, Jan 20, 2023 at 11:09:11 -0600:
>> On 20 Jan 2023, Daniel Shahaf wrote:
>> > Evgeny Kotkov via dev wrote on Thu, 19 Jan 2023 18:52 +00:00:
>> > > I can complete the work on this branch and bring it to a
>> > > production-ready
>> > > state, assuming there are no objections.
>> > 
>> > Your assumption is counterfactual:
>> > 
>> > https://mail-archives.apache.org/mod_mbox/subversion-dev/202301.mbox/%3C20230119152001.GA27446%40tarpaulin.shahaf.local2%3E
>> > 
>> > https://mail-archives.apache.org/mod_mbox/subversion-dev/202212.mbox/%3CCAMHy98NqYBLZaTL5-FAbf24RR6bagPN1npC5gsZenewZb0-EuQ%40mail.gmail.com%3E
>> > 
>> > Objections have been raised, been left unanswered, and now
>> > implementation work has commenced following the original 
>> > design. That's
>> > not acceptable.
>> 
>> I'm a little surprised by your reaction.
>> 
>> It is never "not acceptable" for someone to do implementation 
>> work on a
>> branch while a discussion is happening, even if that discussion 
>> contains
>> objections to or questions about the premise of the branch 
>> work.
>> 
>> It's a branch.  He didn't merge it to trunk, and he posted it 
>> as an explicit
>> invitation for discussion.
>> 
>
>I didn't object to the use of a branch /per se/.  I objected to 
>the
>treating of objections that *had already been posted* as though 
>they had
>never been posted.  *That's* not acceptable.
>
>However, since you ask, I don't think implementing a proposal on
>a branch is necessarily a good idea:
>
>- If the branch is seen and presented as a PoC for furthering 
>discussion
>  and for discovering practical considerations (e.g., that
>  PRISTINE.MD5_CHECKSUM docstring I found yesterday during 
>  discussion,
>  or the ra_serf sha1 optimization that anyone implementing the 
>  branch
>  would run into), it's likely a good thing.
>  
>- On the other hand, when the branch implements the original 
>proposal,
>  whilst outstanding questions were not only not answered but 
>  also not
>  acknowledged, that's quite another thing.  It can result in:
>
>  + The branch maintainer being biased in favour of the approach 
>  they
>    have implemented.  (People tend not to argue against what 
>    they have
>    expended resources on.  Cf. plan continuation bias, sunk cost
>    fallacy.)
>
>  + dev@ being biased towards the approach that has been 
>  implemented
>    (because it's a known entity; because no one is volunteering 
>    to
>    implement another approach; because there's a desire to cut
>    a minor release soon…).  This, in turn, can result in…
>  
>  + …an incentive for participants *not* to hold open design
>    discussions on dev@ in the first place.
>
>> > I'm vetoing the change until a non-rubber-stamp design
>> > discussion has been completed on the public dev@ list.
>> 
>> Starting an implementation on a branch is a valuable 
>> contribution to a
>> design discussion -- it's exactly the kind of 
>> "non-rubber-stamp"
>> contribution one would want.
>> 
>
>You're just repeating what you said above.
>
>> If you want to re-iterate points you've made that have been 
>> left unanswered,
>> that would be a useful contribution -- perhaps some of those 
>> points will be
>> updated now that there's actual code, or perhaps they won't. 
>> Either way,
>> what Evgeny is doing here seems very constructive to me, and 
>> entirely within
>> the normal range of how we do things.
>
>Posting a paragraph such as the one I'm replying to is not 
>"entirely
>within the normal range of how we do things".  As to my points, 
>see
><https://mail-archives.apache.org/mod_mbox/subversion-dev/202301.mbox/%3C20230119152001.GA27446%40tarpaulin.shahaf.local2%3E>.
>They boil down to this:
>
>    <alice> We should migrate away from SHA-1.
>    <bob> Why?
>
>Daniel
>
>> Best regards,
>> -Karl

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Karl Fogel wrote on Fri, Jan 20, 2023 at 11:09:11 -0600:
> On 20 Jan 2023, Daniel Shahaf wrote:
> > Evgeny Kotkov via dev wrote on Thu, 19 Jan 2023 18:52 +00:00:
> > > I can complete the work on this branch and bring it to a
> > > production-ready
> > > state, assuming there are no objections.
> > 
> > Your assumption is counterfactual:
> > 
> > https://mail-archives.apache.org/mod_mbox/subversion-dev/202301.mbox/%3C20230119152001.GA27446%40tarpaulin.shahaf.local2%3E
> > 
> > https://mail-archives.apache.org/mod_mbox/subversion-dev/202212.mbox/%3CCAMHy98NqYBLZaTL5-FAbf24RR6bagPN1npC5gsZenewZb0-EuQ%40mail.gmail.com%3E
> > 
> > Objections have been raised, been left unanswered, and now
> > implementation work has commenced following the original design. That's
> > not acceptable.
> 
> I'm a little surprised by your reaction.
> 
> It is never "not acceptable" for someone to do implementation work on a
> branch while a discussion is happening, even if that discussion contains
> objections to or questions about the premise of the branch work.
> 
> It's a branch.  He didn't merge it to trunk, and he posted it as an explicit
> invitation for discussion.
> 

I didn't object to the use of a branch /per se/.  I objected to the
treating of objections that *had already been posted* as though they had
never been posted.  *That's* not acceptable.

However, since you ask, I don't think implementing a proposal on
a branch is necessarily a good idea:

- If the branch is seen and presented as a PoC for furthering discussion
  and for discovering practical considerations (e.g., that
  PRISTINE.MD5_CHECKSUM docstring I found yesterday during discussion,
  or the ra_serf sha1 optimization that anyone implementing the branch
  would run into), it's likely a good thing.
  
- On the other hand, when the branch implements the original proposal,
  whilst outstanding questions were not only not answered but also not
  acknowledged, that's quite another thing.  It can result in:

  + The branch maintainer being biased in favour of the approach they
    have implemented.  (People tend not to argue against what they have
    expended resources on.  Cf. plan continuation bias, sunk cost
    fallacy.)

  + dev@ being biased towards the approach that has been implemented
    (because it's a known entity; because no one is volunteering to
    implement another approach; because there's a desire to cut
    a minor release soon…).  This, in turn, can result in…
  
  + …an incentive for participants *not* to hold open design
    discussions on dev@ in the first place.

> > I'm vetoing the change until a non-rubber-stamp design
> > discussion has been completed on the public dev@ list.
> 
> Starting an implementation on a branch is a valuable contribution to a
> design discussion -- it's exactly the kind of "non-rubber-stamp"
> contribution one would want.
> 

You're just repeating what you said above.

> If you want to re-iterate points you've made that have been left unanswered,
> that would be a useful contribution -- perhaps some of those points will be
> updated now that there's actual code, or perhaps they won't.  Either way,
> what Evgeny is doing here seems very constructive to me, and entirely within
> the normal range of how we do things.

Posting a paragraph such as the one I'm replying to is not "entirely
within the normal range of how we do things".  As to my points, see
<https://mail-archives.apache.org/mod_mbox/subversion-dev/202301.mbox/%3C20230119152001.GA27446%40tarpaulin.shahaf.local2%3E>.
They boil down to this:

    <alice> We should migrate away from SHA-1.
    <bob> Why?

Daniel

> Best regards,
> -Karl

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Evgeny Kotkov via dev <de...@subversion.apache.org>.

Daniel Shahaf <d....@daniel.shahaf.name> writes:

> > That could happen after a public disclosure of a pair of executable
> > files/scripts where the forged version allows for remote code execution.
> > Or maybe something similar with a file format that is often stored in
> > repositories and that can be executed or used by a build script, etc.
> >
>
> Err, hang on.  Your reference described a chosen-prefix attack, while
> this scenario concerns a single public collision.  These are two
> different things.

A chosen-prefix attack allows finding more meaningful collisions such as
working executables/scripts.  When such collisions are made public, they
would have a greater exploitation potential than just a random collision.

> Disclosure of of a pair of executable files/scripts isn't by itself
> a problem unless one of the pair ("file A") is in a repository
> somewhere.  Now, was the colliding file ("file B") generated _before_ or
> _after_ file A was committed?
>
> - If _before_, then it would seem Mallory had somehow managed to:
>
>   1. get a file of his choosing committed to Alice's repository; and
>
>   2. get a wc of Alice's repository into one of the codepaths that
>      assume SHA-1 is one-to-one / collission-free (currently that's the
>      ra_serf optimization and the 1.15 wc status).

Not only.  There are cases when the working copy itself installs the working
file with a hash lookup in the pristine store.  This is more true for 1.14
than trunk, because in trunk we have the streamy checkout/update that avoid
such lookups by writing straight to the working file.  However, some of
the code paths still install the contents from the pristine store by hash.
Examples include reverting a file, copying an unmodified file, switching
a file with keywords, the mentioned ra_serf optimization, and etc.

>   Now, step #1 seems plausible enough.  As to step #2, it's not clear to
>   me how file B would reach the wc in step #2…

If Mallory has write access, she could commit both files, thus arranging for
a possible content change if both files are checked out to a single working
copy.  This isn't the same as just directly modifying the target file, because
file content isn't expected to change due to changes in other files (that can
be of any type), so this attack has much better chances of being unnoticed.

If Mallory doesn't have write access, there should be other vectors, such
as distributing a pair of files (harmless in the context of their respective
file formats) separately via two upstream channels.  Then, if both of the
upstream distributions are committed into a repository and their files are
checked out together, the content will change, allowing for a malicious
action.

Regards,
Evgeny Kotkov

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Evgeny Kotkov via dev <de...@subversion.apache.org>.

Daniel Shahaf <d....@daniel.shahaf.name> writes:

> Look, it's pretty simple.  You said "We should do Y because it
> addresses X".  You didn't explain why X needs to be addressed, didn't
> consider what alternatives there are to Y, didn't consider any cons that
> Y may have… and when people had questions, you just began to
> implement Y, without responding to or even acknowledging those
> questions.
>
> That's not how design discussions work.  A design discussion doesn't go
> "state decision; state pros; implement"; it goes "state problem; discuss
> potential solutions, pros, cons; decide; implement" (cf. [4, 5, 6]).

Well, I think it may not be as simple as it seems to you.  Who decided that
we should follow the process you're describing?  Is there a thread with a
consensus on this topic?  Or do you insist on using this specific process
because it's the only process that seems obvious to you?  What alternatives
to it have been considered?

As far as I can tell, the process you're suggesting is effectively a
waterfall-like process, and there are quite a lot of concerns about its
effectiveness, because the decisions have to be made in the conditions of
a lack of information.

Personally, I prefer an alternative process that starts from finding out
all available bits of information, which are then used to make informed
decisions.  The unfortunate reality, however, is that the only guaranteed
way of collecting all information means implementing all (or almost all)
significant parts in code.  Roughly speaking, this process looks like a
research project that gets completed by trial and error.

Based on what you've been saying so far, I wouldn't be surprised if you
disagree.  But I still think that forcing the others to follow a certain
process by such means as vetoing a code change is maybe a bit over the
top.  (In the meantime, I certainly won't object if you're going to use this
waterfall-like process for the changes that you implement yourself.)

Regards,
Evgeny Kotkov

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Evgeny Kotkov via dev wrote on Sun, Jan 29, 2023 at 16:37:20 +0300:
> Daniel Shahaf <d....@daniel.shahaf.name> writes:
> 
> > > (I'm not saying that the above rules have to be used in this particular case
> > >  and that a veto is invalid, but still thought it’s worth mentioning.)
> > >
> >
> > I vetoed the change because it hadn't been designed on the dev@ list,
> > had not garnered dev@'s consensus, and was being railroaded through.
> > (as far as I could tell)
> 
> I have *absolutely* no idea where "being railroaded through" comes from.
> Really, it's a wrong way of portraying and thinking about the events that have
> happened so far.
> 
> Reiterating over those events: I wrote an email containing my thoughts
> and explaining the motivation for such change.  I didn't reply to some of
> the questions (including some tricky questions, such as the one featuring
> a theoretical hash function), because they have been at least partly
> answered by others in the thread, and I didn't have anything valuable
> to add at that time.
> 
> During that time, I was actively coding the core part of the change,
> to check if it's possible technically.  Which is important, as far as
> I believe, because not all theoretically possible solutions can be implemented
> without facing significant practical or implementation-related issues, and
> it seems to me that you significantly undervalue such an approach.
> 

Quoting myself from elsethread: [3]

    - If the branch is seen and presented as a PoC for furthering discussion
      and for discovering practical considerations (e.g., that
      PRISTINE.MD5_CHECKSUM docstring I found yesterday during discussion,
      or the ra_serf sha1 optimization that anyone implementing the branch
      would run into), it's likely a good thing.

> I do not say my actions were exemplary, but as far as I can tell, they're
> pretty much in line with how svn-dev has been operating so far.  But, it all
> resulted in an unclear veto without any _technical_ arguments, where what's
> being vetoed is unclear as well, because the change was not ready at the
> moment veto got casted.
> 

Look, it's pretty simple.  You said "We should do Y because it
addresses X".  You didn't explain why X needs to be addressed, didn't
consider what alternatives there are to Y, didn't consider any cons that
Y may have… and when people had questions, you just began to
implement Y, without responding to or even acknowledging those
questions.

That's not how design discussions work.  A design discussion doesn't go
"state decision; state pros; implement"; it goes "state problem; discuss
potential solutions, pros, cons; decide; implement" (cf. [4, 5, 6]).

That's why I called veto: not because I considered any particular
proposal then on the table unreasonable, but because I considered /the
decision process being used/ unreasonable (cf. [7]).

> And because your veto goes in favor of a specific process

Yes, I'm arguing in favour of first defining a problem, then considering
solutions to it, both their pros and cons, and only then deciding what
to implement.  This process isn't unique, novel, or singular; it's
standard in multiple disciplines [4–7].

>                                                           (considering that
> no other arguments were given), the only thing that's *actually* being
> railroaded is an odd form of an RTC (review-then-commit) process that is
> against our usual CTR (commit-then-review) [1,2].  That's railroading,
> because it hasn't been explicitly discussed anywhere and a consensus
> on it has not been reached.

This thread was started on 2022-12-20 [1], with the idiomatic
"Thoughts?" sign-off.  The first relevant code was committed on
2023-01-19 [2].

That is: the change followed RTC to begin with.  Considering that both
[1] and [2] were authored by you personally, I find it difficult to
charitably interpret your claim that "an odd form of [RTC]" was being
"railroaded", as RTC rather than "our usual CTR [process]" was being
followed at your own decision.

It's perhaps worth pointing out the veto followed the branch creation
because that was the point when I gave up on waiting for someone to
respond to the objections that had been made by then.  It wasn't a veto
on using a branch, as I have clarified: [3]

    I didn't object to the use of a branch /per se/.  I objected to the
    treating of objections that *had already been posted* as though they had
    never been posted.  *That's* not acceptable.

So, no, I wasn't advocating /either/ RTC or CTR; I was advocating that
the "R" step happen at all.  A branch may take place before, during, or
after discussion — see [3] for more — but the important thing is that
discussion happen.  The OP doesn't have to agree with all points made,
but doesn't get to ignore them and proceed as though they have never
been posted.

Daniel

[1] https://mail-archives.apache.org/mod_mbox/subversion-dev/202212.mbox/%3CCAP_GPNh2erpHzP0umxV_MuZRXKCkW_n8gJEGsM4aafqcKk02RQ%40mail.gmail.com%3E
[2] r1906817
[3] https://mail-archives.apache.org/mod_mbox/subversion-dev/202301.mbox/%3C20230121092231.GA3174%40tarpaulin.shahaf.local2%3E
[4] https://skybrary.aero/articles/dec
[5] http://paulgraham.com/essay.html under the second and third headings
[6] https://xyproblem.info/
[7] the Business Judgment Rule

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Karl Fogel <kf...@red-bean.com>.

On 31 Jan 2023, Daniel Shahaf wrote:
>Karl Fogel wrote on Mon, 30 Jan 2023 23:26 +00:00:
>> Daniel, given what's in Evgeny's branch now, could you 
>> summarize 
>> your current technical objections if any?
>
>Certainly, but I won't have time to do so today.

Oh, my gosh, I'd be the last person to ever complain about someone 
not being prompt in sending a detailed technical reply here :-). 
It takes me *weeks* sometimes.  Whenever you get time is good.

Best regards,
-Karl

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Karl Fogel wrote on Mon, 30 Jan 2023 23:26 +00:00:
> Daniel, given what's in Evgeny's branch now, could you summarize 
> your current technical objections if any?

Certainly, but I won't have time to do so today.

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Karl Fogel <kf...@red-bean.com>.

On 29 Jan 2023, Evgeny Kotkov via dev wrote:
>I have *absolutely* no idea where "being railroaded through" 
>comes from.
>Really, it's a wrong way of portraying and thinking about the 
>events that have
>happened so far.
>
>Reiterating over those events: I wrote an email containing my 
>thoughts
>and explaining the motivation for such change.  I didn't reply to 
>some of
>the questions (including some tricky questions, such as the one 
>featuring
>a theoretical hash function), because they have been at least 
>partly
>answered by others in the thread, and I didn't have anything 
>valuable
>to add at that time.
>
>During that time, I was actively coding the core part of the 
>change,
>to check if it's possible technically.  Which is important, as 
>far as
>I believe, because not all theoretically possible solutions can 
>be implemented
>without facing significant practical or implementation-related 
>issues, and
>it seems to me that you significantly undervalue such an 
>approach.
>
>I do not say my actions were exemplary, but as far as I can tell, 
>they're
>pretty much in line with how svn-dev has been operating so far. 
>But, it all
>resulted in an unclear veto without any _technical_ arguments, 
>where what's
>being vetoed is unclear as well, because the change was not ready 
>at the
>moment veto got casted.
>
>And because your veto goes in favor of a specific process 
>(considering that
>no other arguments were given), the only thing that's *actually* 
>being
>railroaded is an odd form of an RTC (review-then-commit) process 
>that is
>against our usual CTR (commit-then-review) [1,2].  That's 
>railroading,
>because it hasn't been explicitly discussed anywhere and a 
>consensus
>on it has not been reached.

Daniel, given what's in Evgeny's branch now, could you summarize 
your current technical objections if any?

If they are something like "This code is solving the wrong 
problem(s)" or "I'm not sure what problem(s) it's supposed to 
solve", those count as technical objections.  It's just that it 
would be useful to have the objection(s) gathered in one place. 
This thread has been long and somewhat digressive -- I'm not 
saying that's due to you -- and I at least have found it a bit 
difficult to keep track of the concrete objections versus various 
interesting but ultimately theoretical points.

The reason I'm supportive of Evgeny's direction is that his 
changes, if completed, would offer a solution to the (admittedly 
still somewhat distant) security concern I raised early on. 
Essentially, I'm worried that second-preimage attacks on SHA-1 are 
coming eventually (maybe I'm wrong about this -- they are after 
all significantly harder than mere collision attacks).  *If* such 
attacks become possible, then our WC could report a file as 
unmodified when in fact it is modified, which would have real 
security implications, as I outlined.

Like I said, this is far from urgent, and IMHO it certainly should 
not delay a release of our new pristineless feature.  But when and 
if Evgeny's branch is ready (where "ready" presumably includes 
something other than salted SHA-1 as the other checksum option), I 
would like to see these changes go in, unless we identify some 
harm from them.

For everyone's ease of reference:

$ svn cat 
https://svn.apache.org/repos/asf/subversion/branches/pristine-checksum-kind/BRANCH-README

$ svn log --stop-on-copy 
https://svn.apache.org/repos/asf/subversion/branches/pristine-checksum-kind/

Best regards,
-Karl

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Evgeny Kotkov via dev <de...@subversion.apache.org>.

Daniel Shahaf <d....@daniel.shahaf.name> writes:

> > (I'm not saying that the above rules have to be used in this particular case
> >  and that a veto is invalid, but still thought it’s worth mentioning.)
> >
>
> I vetoed the change because it hadn't been designed on the dev@ list,
> had not garnered dev@'s consensus, and was being railroaded through.
> (as far as I could tell)

I have *absolutely* no idea where "being railroaded through" comes from.
Really, it's a wrong way of portraying and thinking about the events that have
happened so far.

Reiterating over those events: I wrote an email containing my thoughts
and explaining the motivation for such change.  I didn't reply to some of
the questions (including some tricky questions, such as the one featuring
a theoretical hash function), because they have been at least partly
answered by others in the thread, and I didn't have anything valuable
to add at that time.

During that time, I was actively coding the core part of the change,
to check if it's possible technically.  Which is important, as far as
I believe, because not all theoretically possible solutions can be implemented
without facing significant practical or implementation-related issues, and
it seems to me that you significantly undervalue such an approach.

I do not say my actions were exemplary, but as far as I can tell, they're
pretty much in line with how svn-dev has been operating so far.  But, it all
resulted in an unclear veto without any _technical_ arguments, where what's
being vetoed is unclear as well, because the change was not ready at the
moment veto got casted.

And because your veto goes in favor of a specific process (considering that
no other arguments were given), the only thing that's *actually* being
railroaded is an odd form of an RTC (review-then-commit) process that is
against our usual CTR (commit-then-review) [1,2].  That's railroading,
because it hasn't been explicitly discussed anywhere and a consensus
on it has not been reached.

[1] https://www.apache.org/foundation/glossary.html#CommitThenReview
[2] https://www.apache.org/foundation/glossary.html#ReviewThenCommit

Regards,
Evgeny Kotkov

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Evgeny Kotkov via dev wrote on Mon, Jan 23, 2023 at 02:28:50 +0300:
> Daniel Shahaf <d....@daniel.shahaf.name> writes:
> 
> > > I can complete the work on this branch and bring it to a production-ready
> > > state, assuming there are no objections.
> >
> > Your assumption is counterfactual:
> >
> > https://mail-archives.apache.org/mod_mbox/subversion-dev/202301.mbox/%3C20230119152001.GA27446%40tarpaulin.shahaf.local2%3E
> >
> > https://mail-archives.apache.org/mod_mbox/subversion-dev/202212.mbox/%3CCAMHy98NqYBLZaTL5-FAbf24RR6bagPN1npC5gsZenewZb0-EuQ%40mail.gmail.com%3E
> 
> I don't see any explicit objections in these two emails (here I assume that
> if something is not clear to a PMC member, it doesn't automatically become
> an objection).  If the "why?" question is indeed an objection, then I would
> say it has already been discussed and responded to in the thread.
> 

The "Why?" was sent _after_ the post you're quoting, and in any case was
just an elevator pitch summary of something I had explained more verbosely.

The first post in this thread asserts X is a problem and Y is a solution
to it, and argues that Y is a good thing.  However, that post does not
explain /why/ X is a problem, does not consider alternatives to Y, and
does not consider possible cons of Y.  That's what's missing.

> Now, returning to the problem:
> 
> As described in the advisory [1], we have a supported configuration that
> makes data forgery possible:
> 
> - A repository with disabled rep-sharing allows storing different files with
>   colliding SHA-1 values.
> - Having a repository with disabled rep-sharing is a supported configuration.
>   There may be a certain number of such repositories in the wild
>   (for example, created with SVN < 1.6 and not upgraded afterwise).
> - A working copy uses an assumption that the pristine contents are equal if
>   their SHA-1 hashes are equal.
> - So committing different files with colliding SHA-1 values makes it possible
>   to forge the contents of a file that will be checked-out and used by the
>   client.
> 
> I would say that this state is worrying just by itself.
> 

I assume this situation could happen accidentally, say, if someone adds
shattered-1.pdf and shattered-2.pdf to the same wc in a particular way.
That is, I'm not assuming "forgery" (which implies Mallory is involved).

Still, this is a potential data integrity issue with the new-in-1.15 wc
format, so we should address it before the release.  What are our
options to address that?  Switching to another checksum is an option,
yes, but we [as in, dev@] don't seem to have considered any alternatives
to that.

Just off the top of my head, we could:

- Encourage or require use of rep-sharing
  [the advisory already recommends this]

- Encourage or require use of tools/hook-scripts/reject-detected-sha1-collisions.sh
  [the advisory already recommends this]

- Have f32 wc's refuse to talk to servers that don't detect SHA-1
  collisions.  (1.15 users will still be able to interoperate with old
  servers by using f31.)

And there may be more options.  (Lurkers are invited to speak up!)

> However, with the feasibility of chosen-prefix attacks on SHA-1 [2], it's
> probably only a matter of time until the situation becomes worse.
> 

Quoting the third hunk of 
<https://mail-archives.apache.org/mod_mbox/subversion-dev/202212.mbox/%3C20221220201300.GH32332%40tarpaulin.shahaf.local2%3E>:

    What's the acceptance test we use for candidate checksum algorithms?

    You say we should switch to a checksum algorithm that doesn't have known
    collisions, but, why should we require that?  Consider the following
    160-bit checksum algorithm:
    .
        1. If the input consists of 40 ASCII lowercase hex digits and
           nothing else, return the input.
        2. Else, return the SHA-1 of the input.

    This algorithm has a trivial first preimage attack.  If a wc used this
    identity-then-sha1 algorithm instead of SHA-1, then… what?

> That could happen after a public disclosure of a pair of executable
> files/scripts where the forged version allows for remote code execution.
> Or maybe something similar with a file format that is often stored in
> repositories and that can be executed or used by a build script, etc.
> 

Err, hang on.  Your reference described a chosen-prefix attack, while
this scenario concerns a single public collision.  These are two
different things.

Disclosure of of a pair of executable files/scripts isn't by itself
a problem unless one of the pair ("file A") is in a repository
somewhere.  Now, was the colliding file ("file B") generated _before_ or
_after_ file A was committed?

- If _before_, then it would seem Mallory had somehow managed to:

  1. get a file of his choosing committed to Alice's repository; and

  2. get a wc of Alice's repository into one of the codepaths that
     assume SHA-1 is one-to-one / collission-free (currently that's the
     ra_serf optimization and the 1.15 wc status).

  Now, step #1 seems plausible enough.  As to step #2, it's not clear to
  me how file B would reach the wc in step #2… but insofar as security
  assumptions go, it seems reasonable to assume Mallory can make this
  happen.

  So, I agree it's a scenario we should address.  What options do we
  have to address it?  (I grant that migrating away from SHA-1 is one
  option.)

- If _after_, then you're presuming not simply a collision attack but
  a second preimage attack.  Should we assume Mallory to be able to
  mount a second preimage attack?

Chosen-prefix collision attacks can help Mallory in a variant of the
"before" case: Mallory computes a collision, sends file A to Alice (who
commits it), and invokes his assumed ability to inject file B into
Alice's wc.  This would work for file formats that ignore the unchosen
suffix.

> [1] https://subversion.apache.org/security/sha1-advisory.txt
> [2] https://sha-mbles.github.io/
> 
> 
> Speaking of the proposed switch to SHA-256 or a different checksum, there's
> an argument by contradiction: if we were designing the pristineless working
> copy from scratch today, would we choose SHA-1 as the best available hash
> that can be used to assert content equality?

If we were designing f32 from the ground up, I hope we'd first nail down
our requirements and then check what are the possible ways to address
them.

We might specify that "The probability of birthday collisions in
<USE-CASE> must not exceed <PERCENTAGE>.".

[E.g., my parents named each of their kids for the CRC32 of the
timestamp,longitude,latitude,altitude of that kid's birth, and that was
fine: they had no collisions.  In contrast, sea turtles shouldn't use
CRC32 to name their kids, since they'd have a ≈5% chance of a collision
due to their larger number of offspring.  However, sea turtles would
have no collisions if they used MD5.]

We might specify that "The hash of an <N>-byte file can be computed
within <T> milliseconds on <SUCH AND SUCH> hardware."

[E.g., the existence of https://en.wikipedia.org/wiki/Intel_SHA_extensions
is a consideration.]

We might specify that "An attacker who is capable of <SUCH AND SUCH>
will not be able to cause a false positive or a false negative in the wc
status optimization.".

[E.g., see above about second preimage attacks.]

And then we'd brainstorm possible solutions (plural) and run each of
them through the specifications, which would be our acceptance test
checklist.

(And since we aren't designing from scratch, our actual acceptance test
would also include implementation and maintenance costs for us and
upgrade costs for our users.)

> If yes, how can one prove that?

Well, for starters, rep-sharing was released in in 2009, the first
public collision (shattered) was published in 2017, a chosen-prefix
attack (shambled) in 2020, and we haven't had any complaints since then
<fine print>other than from people literally trying to store
shattered-1.pdf and shattered-2.pdf in their repos</fine print>?

And so long as we're doing thought experiments, here's another: If we
switched to using only MD5 internally, would anyone notice?  (Cf. above
about identity-then-sha1, which is even weaker than MD5.)

> > Objections have been raised, been left unanswered, and now implementation
> > work has commenced following the original design.  That's not acceptable.
> > I'm vetoing the change until a non-rubber-stamp design discussion has
> > been completed on the public dev@ list.
> 
> I would like to note that vetoing a code modification should be accompanied
> with a technical justification, and I have certain doubts that the above
> arguments qualify as such:
> 
> https://www.apache.org/foundation/voting.html
> [[[
> To prevent vetoes from being used capriciously, the voter must provide
> with the veto a technical justification showing why the change is bad
> (opens a security exposure, negatively affects performance, etc. ).
> A veto without a justification is invalid and has no weight.
> ]]]
> 
> (I'm not saying that the above rules have to be used in this particular case
>  and that a veto is invalid, but still thought it’s worth mentioning.)
> 

I vetoed the change because it hadn't been designed on the dev@ list,
had not garnered dev@'s consensus, and was being railroaded through.
(as far as I could tell)

> Anyway, I'll stop working on the branch, because a veto has been casted.

That's your decision.  Implementing one design on a branch while other
options are being considered by dev@ /is/ possible, but there are some
risks with that; cf. my remarks in
<https://mail-archives.apache.org/mod_mbox/subversion-dev/202301.mbox/%3C20230121092231.GA3174%40tarpaulin.shahaf.local2%3E>.

And once again, for clarity: I'm not vetoing migrating away from SHA-1.
(In fact, my intuition was that it'd be a good idea.)

Daniel

> 
> Regards,
> Evgeny Kotkov

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Evgeny Kotkov via dev <de...@subversion.apache.org>.

Daniel Shahaf <d....@daniel.shahaf.name> writes:

> > I can complete the work on this branch and bring it to a production-ready
> > state, assuming there are no objections.
>
> Your assumption is counterfactual:
>
> https://mail-archives.apache.org/mod_mbox/subversion-dev/202301.mbox/%3C20230119152001.GA27446%40tarpaulin.shahaf.local2%3E
>
> https://mail-archives.apache.org/mod_mbox/subversion-dev/202212.mbox/%3CCAMHy98NqYBLZaTL5-FAbf24RR6bagPN1npC5gsZenewZb0-EuQ%40mail.gmail.com%3E

I don't see any explicit objections in these two emails (here I assume that
if something is not clear to a PMC member, it doesn't automatically become
an objection).  If the "why?" question is indeed an objection, then I would
say it has already been discussed and responded to in the thread.

Now, returning to the problem:

As described in the advisory [1], we have a supported configuration that
makes data forgery possible:

- A repository with disabled rep-sharing allows storing different files with
  colliding SHA-1 values.
- Having a repository with disabled rep-sharing is a supported configuration.
  There may be a certain number of such repositories in the wild
  (for example, created with SVN < 1.6 and not upgraded afterwise).
- A working copy uses an assumption that the pristine contents are equal if
  their SHA-1 hashes are equal.
- So committing different files with colliding SHA-1 values makes it possible
  to forge the contents of a file that will be checked-out and used by the
  client.

I would say that this state is worrying just by itself.

However, with the feasibility of chosen-prefix attacks on SHA-1 [2], it's
probably only a matter of time until the situation becomes worse.

That could happen after a public disclosure of a pair of executable
files/scripts where the forged version allows for remote code execution.
Or maybe something similar with a file format that is often stored in
repositories and that can be executed or used by a build script, etc.

[1] https://subversion.apache.org/security/sha1-advisory.txt
[2] https://sha-mbles.github.io/

Speaking of the proposed switch to SHA-256 or a different checksum, there's
an argument by contradiction: if we were designing the pristineless working
copy from scratch today, would we choose SHA-1 as the best available hash
that can be used to assert content equality?  If yes, how can one prove that?

> Objections have been raised, been left unanswered, and now implementation
> work has commenced following the original design.  That's not acceptable.
> I'm vetoing the change until a non-rubber-stamp design discussion has
> been completed on the public dev@ list.

I would like to note that vetoing a code modification should be accompanied
with a technical justification, and I have certain doubts that the above
arguments qualify as such:

https://www.apache.org/foundation/voting.html
[[[
To prevent vetoes from being used capriciously, the voter must provide
with the veto a technical justification showing why the change is bad
(opens a security exposure, negatively affects performance, etc. ).
A veto without a justification is invalid and has no weight.
]]]

(I'm not saying that the above rules have to be used in this particular case
 and that a veto is invalid, but still thought it’s worth mentioning.)

Anyway, I'll stop working on the branch, because a veto has been casted.

Regards,
Evgeny Kotkov

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Karl Fogel <kf...@red-bean.com>.

On 20 Jan 2023, Daniel Shahaf wrote:
>Evgeny Kotkov via dev wrote on Thu, 19 Jan 2023 18:52 +00:00:
>> I can complete the work on this branch and bring it to a 
>> production-ready
>> state, assuming there are no objections.
>
>Your assumption is counterfactual:
>
>https://mail-archives.apache.org/mod_mbox/subversion-dev/202301.mbox/%3C20230119152001.GA27446%40tarpaulin.shahaf.local2%3E
>
>https://mail-archives.apache.org/mod_mbox/subversion-dev/202212.mbox/%3CCAMHy98NqYBLZaTL5-FAbf24RR6bagPN1npC5gsZenewZb0-EuQ%40mail.gmail.com%3E
>
>Objections have been raised, been left unanswered, and now
>implementation work has commenced following the original design. 
>That's
>not acceptable.

I'm a little surprised by your reaction.

It is never "not acceptable" for someone to do implementation work 
on a branch while a discussion is happening, even if that 
discussion contains objections to or questions about the premise 
of the branch work.

It's a branch.  He didn't merge it to trunk, and he posted it as 
an explicit invitation for discussion.

>I'm vetoing the change until a non-rubber-stamp design
>discussion has been completed on the public dev@ list.

Starting an implementation on a branch is a valuable contribution 
to a design discussion -- it's exactly the kind of 
"non-rubber-stamp" contribution one would want.

If you want to re-iterate points you've made that have been left 
unanswered, that would be a useful contribution -- perhaps some of 
those points will be updated now that there's actual code, or 
perhaps they won't.  Either way, what Evgeny is doing here seems 
very constructive to me, and entirely within the normal range of 
how we do things.

Best regards,
-Karl

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Nathan Hartman <ha...@gmail.com>.

On Fri, Jan 20, 2023 at 9:51 AM Nathan Hartman <ha...@gmail.com> wrote:
>
> On Fri, Jan 20, 2023 at 7:18 AM Daniel Shahaf <d....@daniel.shahaf.name> wrote:
> >
> > Evgeny Kotkov via dev wrote on Thu, 19 Jan 2023 18:52 +00:00:
> > > I can complete the work on this branch and bring it to a production-ready
> > > state, assuming there are no objections.
> >
> > Your assumption is counterfactual:
> >
> > https://mail-archives.apache.org/mod_mbox/subversion-dev/202301.mbox/%3C20230119152001.GA27446%40tarpaulin.shahaf.local2%3E
> >
> > https://mail-archives.apache.org/mod_mbox/subversion-dev/202212.mbox/%3CCAMHy98NqYBLZaTL5-FAbf24RR6bagPN1npC5gsZenewZb0-EuQ%40mail.gmail.com%3E
> >
> > Objections have been raised, been left unanswered, and now
> > implementation work has commenced following the original design.  That's
> > not acceptable.  I'm vetoing the change until a non-rubber-stamp design
> > discussion has been completed on the public dev@ list.
>
>
> I think we can start by discussing some of the pros and cons.
>
> There are two separate things here but they end up being mixed
> together in the discussions:
>
> 1. Pros/cons of switching from SHA1 to another hash.
> 2. Supporting different hash types in f32.
>
> Regarding the first item:
>
> Do we need to switch from SHA1 to another hash? One con that was
> already mentioned [1] is that we'll never really be able to switch
> away from SHA1, as there are existing clients, servers, and working
> copies out there. Not only will we have to support SHA1 forever for
> backwards compatibility, but any new hash that is ever added will need
> to be supported forever as well. If we accumulate many of those, it
> might become a burden, but perhaps there will be only one new hash and
> it will be the "blessed" one for the next 20 years.
>
> There were concerns about collisions; since the space of possible
> input datasets is infinite and the hash code size is fixed and finite
> (pretty large, but very much finite), there will always be collisions
> with any hash. The significant questions are: how small is the
> probability of a collision, and (for the purposes of security) how
> hard is it to generate input data that produces a collision? The
> answer to the first question is fixed; the second one is probably
> expected to change over time, as algorithms are studied and new
> vulnerabilities are found. Which hash type do you pick, and who knows
> if a hash thought to be very strong (today) later proves easier to
> crack than one that is thought not as strong? We can only guess.
>
> Taking a step back, this discussion started because pristine-free WCs
> are IIUC more dependent on comparing hashes than pristineful WCs, and
> therefore a hash collision could have more impact in a pristine-free
> WC. "Guarantees" were mentioned, but I think it's important to state
> that there's only a guarantee of probability, since as mentioned above
> all hashes will have collisions.
>
> We already can't store files with identical SHA1 hashes, but AFAIK the
> only meaningful impact we've ever heard is that security researchers
> cannot track files they generate with deliberate collisions. The same
> would be true with any hash type, for collisions within that hash
> type.
>
> Advantages of switching to a new hash type might include: reducing the
> already small probability of collisions; choosing an algorithm that is
> faster or that has (or is expected to have in the future) hardware
> acceleration on commodity systems, perhaps addressing user perception
> (if SHA1 is seen as old and uncool), but then again, we can't really
> get rid of SHA1...
>
> [1] https://lists.apache.org/thread/v3dv1dtod2t9yrf920h4838g2t0l94cw
>
> Regarding the second item:
>
> Since the premise of this feature is to support adding new hash types
> without bumping wc formats, it follows that any new hash type will
> create compatibility problems for clients that support f32 but not the
> specific new hash type. In light of that, it might just be better to
> bump the wc format and then you know at the outset that you need to
> upgrade your client. Just thinking out loud here but this might be
> (partly) mitigated by trying to guess which hash types we might want
> in the future and supporting them now, even if no existing client will
> actually use them, but I don't really like this idea.
>
> I'll have to return later with more thoughts...

Just quickly I want to say that although I mentioned mostly cons
above, I don't want to appear to be against switching hashes nor
against supporting multiple hash types in f32; rather, since the
i525-pod feature necessitated a format bump anyway, I do think it
makes sense to consider adding such changes now, to avoid a future
format bump, and I'm considering arguments contrary to that from a
desire to be unbiased about it.

I have more thoughts (including more pros) but have some things to
attend to now.

Looking forward to hearing others' thoughts as well.

Cheers,
Nathan

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Nathan Hartman wrote on Fri, 20 Jan 2023 14:51 +00:00:
> 1. Pros/cons of switching from SHA1 to another hash.
⋮
> Do we need to switch from SHA1 to another hash? One con that was
> already mentioned [1] is that we'll never really be able to switch
> away from SHA1, as there are existing clients, servers, and working
> copies out there. Not only will we have to support SHA1 forever for
> backwards compatibility,

Actually, I think it's MD5, not SHA-1, that we have to support
indefinitely, since our uses of SHA-1 fall into two categories:

- Accompanied by MD5.  (wc.db PRISTINE table, FSFS node-rev headers,
  dumpfiles' Text-content-* headers)

- An optional optimization.  (ra_serf, rep-cache.db)

>                          but any new hash that is ever added will need
> to be supported forever as well. If we accumulate many of those, it
> might become a burden,

Good point.  Then perhaps we should continue to record two checksums, as
both wc.db and FSFS do?  If we record, say, both «(svn_checksum_kind_t)42»
checksums and «(svn_checksum_kind_t)value_of_the_month» checksums, then
we'll only need to be able to upgrade from the former.

>                        but perhaps there will be only one new hash and
> it will be the "blessed" one for the next 20 years.

Cheers,

Daniel

P.S.  wc-metadata.sql implies that having MD5 collisions in a wc is supported:

     1	/* wc-metadata.sql -- schema used in the wc-metadata SQLite database
     2	 *     This is intended for use with SQLite 3
     ⋮
    94	CREATE TABLE PRISTINE (
    95	  /* The SHA-1 checksum of the pristine text. This is a unique key. The
    96	     SHA-1 checksum of a pristine text is assumed to be unique among all
    97	     pristine texts referenced from this database. */
    98	  checksum  TEXT NOT NULL PRIMARY KEY,
    99	
     ⋮
   114	  /* Alternative MD5 checksum used for communicating with older
   115	     repositories. Not strictly guaranteed to be unique among table rows. */
   116	  md5_checksum  TEXT NOT NULL
   117	  );
   118	
   119	CREATE INDEX I_PRISTINE_MD5 ON PRISTINE (md5_checksum);

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

[ tl;dr: See last paragraph for a concrete question about ra_serf. ]

Karl Fogel wrote on Fri, 20 Jan 2023 17:18 +00:00:
> Yes.  A hash is considered "broken" the moment security researches 
> can generate a collision.

Consider the following uses of hash functions in our code:

- FSFS rep-cache uses SHA-1.

- The ra_serf download optimization uses SHA-1.

- The commit editor uses MD5 in apply_textdelta() and close_file().

The first one is fine, because FSFS rejects collisions in new commits
(as pointed out upthread).

The second one is not necessarily fine: a variation of the attack you (kfogel)
described could make a client wrongly trigger the optimization and end
up with the wrong fulltext.

The third one is fine, because the delta and its resulting fulltext's
checksum don't travel separately.

So, there you have it: a use of SHA-1 which can stay as-is, a use of SHA-1
which may need attention, and a use of MD5 which can stay as-is — all
in the same codebase.

Thus, whether a hash function is "broken" or not depends on the context
in which it is used.

----

To be clear, the ra_serf thing which "may need attention" is the use
of «final_sha1_checksum» in subversion/libsvn_ra_serf/update.c.  That's
a place where we assume SHA-1 is one-to-one.

Cheers,

Daniel

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Nathan Hartman <ha...@gmail.com>.

Replying to multiple parts of this thread...

On Sat, Jan 21, 2023 at 12:58 PM Karl Fogel <kf...@red-bean.com> wrote:
> *nod* This issue isn't important enough to me to continue the
> conversation -- I'd like for new hash algorithms to be possible,
> and I think Evgeny's work on it is worthwhile, but I don't feel
> nearly as strongly about this as I feel about making the new
> pristineless working copies available in an official release as
> soon as we can.

I think it's certainly worthwhile to explore the multi-hash feature,
and if it can be in 1.15, that's good too. But if it will take a while
to *hash* out the details (pun intended) then I'm okay with letting it
wait for a future release in the interest of getting the i525pod
feature out there, even though that means a (possible) future format
bump. (i525pod provides a substantial immediate benefit, while a format
bump isn't necessarily the end of the world).

Having said so, continuing to explore the multi-hash idea:

Previously, I wrote: "Since the premise of this feature is to support
adding new hash types without bumping wc formats, it follows that any
new hash type will create compatibility problems for clients that
support f32 but not the specific new hash type. In light of that, it
might just be better to bump the wc format and then you know at the
outset that you need to upgrade your client. Just thinking out loud
here but this might be (partly) mitigated by trying to guess which hash
types we might want in the future and supporting them now, even if no
existing client will actually use them, but I don't really like this
idea."

I didn't like my own idea at the time, but the following got me
thinking:

On Sun, Jan 22, 2023 at 7:41 AM Daniel Shahaf <d....@daniel.shahaf.name> wrote:
> The server is aware of what algorithm the wc uses on the wire, which is
> SHA-1 in ra_serf's download optimization and MD5 in svn_delta_editor_t::apply_textdelta()
> and svn_delta_editor_t::close_file().  However, the algorithm(s) used by
> the wc for naming pristines and, in f32, for detecting local mods are
> implementation details of the wc.
>
> So, suppose the wc didn't hardcode _any particular_ hash function for
> naming pristines and for status walks — not md5, not sha1, not sha256 —
> but had each «svn checkout» run pick a hash function uniformly at random
> out of a large enough family of hash functions[1].  (Intuitively, think
> of a family of hash functions as a hash function with a random salt,
> similar to [2].)
>
> This way, even if someone tried to deliberately create a collision, they
> wouldn't be able to pick a collision "off the shelf", as with
> shattered.io; they'd need to compute a collision for the specific hash
> function ("salt") used by that particular wc.  That's more difficult than
> creating a collision in a well-known hash function, regardless of
> whether we treat the salt's value as a secret of the wc (as in, stored
> in a mode-0400 file in under .svn directory and not disclosed to the
> server) or as a value the attacker is assumed to know.
>
> So, that's one way to address the scenario kfogel described.

Suppose the wc is made to support multiple hash types, support is added
now for "many" hash types (leaving open the question of "how many" and
which ones for now), and at checkout time, one is chosen, either "at
random" as suggested by danielsh, or, say, by some explicit user
option.

Suppose also that there is a possibility for the user to blacklist some
hash types which the user does not want used at all.

Now, if a specific hash type is later cracked (in the shattered.io
sense), the security fix on SVN's end is to add that hash type to the
default blacklist of hash types. It would still be supported, but new
working copies wouldn't choose it. In the advisory for said fix, we'd
document a workaround for users who can't/won't upgrade: the steps
users can take to blacklist the affected hash types on their systems,
in effect getting the same outcome as upgrading.

One caveat: In either case (whether the user upgrades or applies the
workaround), they'd have to check out new working copies (or maybe run
some invocation of 'svn upgrade') or the existing hashes won't be
changed.

And there's also this:

On Sat, Jan 21, 2023 at 5:25 AM Daniel Shahaf <d....@daniel.shahaf.name> wrote:
> For example, if we used another checksum algorithm, the attacker from
> your scenario might opt to edit the base checksums in .svn/wc.db and
> rename the .svn/pristine/ files accordingly.  That's much easier to pull
> off, and will be easy to adapt if we change the algorithm again, but on
> the other hand, requires write access to the .svn directory and is
> easier to discover.

Yup. Once an attacker has write access to the .svn contents, all bets
are off anyway.

Cheers,
Nathan

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

[See below a proposal that libsvn_wc not use any fixed hash function.]

Martin Edgar Furter Rathod wrote on Sat, 21 Jan 2023 05:22 +00:00:
> On 20.01.23 22:48, Karl Fogel wrote:
>> On 20 Jan 2023, Nathan Hartman wrote:
>>> We already can't store files with identical SHA1 hashes, but AFAIK the
>>> only meaningful impact we've ever heard is that security researchers
>>> cannot track files they generate with deliberate collisions. The same
>>> would be true with any hash type, for collisions within that hash
>>> type.
>> 
>> Yes.  A hash is considered "broken" the moment security researches can 
>> generate a collision.
>
> No matter what hash function you choose now, sooner or later it will be 
> broken.
>
> But a broken hash function can still be good enough for use in tools 
> like subversion if it is used correctly. Instead of just storing the 
> hash value subversion should also store a sequence number. Whenever a 
> collision happens subversion has to compare the two (or more) files 
> which have the same hash value.

So, basically, just do what the implementation of hashes (the data
structure mapping keys to values) does?

I think this would work in most of our uses of checksums, and make it
possible to have collisions in both the repository and the wc.

However, what about running `svn status` when there's an unhydrated file
that has been modified in a way that changes the fulltext but doesn't
change the checksum value?  In this case the BASE fulltext isn't
available locally to compare with.

----

I think there is actually something we can do about this: stop
hardcoding any particular hash function in libsvn_wc's internals.

The server is aware of what algorithm the wc uses on the wire, which is
SHA-1 in ra_serf's download optimization and MD5 in svn_delta_editor_t::apply_textdelta()
and svn_delta_editor_t::close_file().  However, the algorithm(s) used by
the wc for naming pristines and, in f32, for detecting local mods are
implementation details of the wc.

So, suppose the wc didn't hardcode _any particular_ hash function for
naming pristines and for status walks — not md5, not sha1, not sha256 —
but had each «svn checkout» run pick a hash function uniformly at random
out of a large enough family of hash functions[1].  (Intuitively, think
of a family of hash functions as a hash function with a random salt,
similar to [2].)

This way, even if someone tried to deliberately create a collision, they
wouldn't be able to pick a collision "off the shelf", as with
shattered.io; they'd need to compute a collision for the specific hash
function ("salt") used by that particular wc.  That's more difficult than
creating a collision in a well-known hash function, regardless of
whether we treat the salt's value as a secret of the wc (as in, stored
in a mode-0400 file in under .svn directory and not disclosed to the
server) or as a value the attacker is assumed to know.

So, that's one way to address the scenario kfogel described.

Thanks for speaking up, Martin.

Daniel

[1] I'm not making this term up; see, for instance, page 143 of
    https://cseweb.ucsd.edu/~mihir/papers/gb.pdf.  "𝒦" is keyspace,
    "D" is domain, "R" is range.  A random element K ∈ 𝒦 is chosen and the
    hash function H_K [aka H with currying of the first parameter] is
    used thereafter.

[2]
    def f(foo):
        return sha1(str(foo) + f.salt)
    f.salt = str(random_thing())

> If the files are identical the old 
> hash+number pair is stored. If they differ the new file gets a new 
> sequence number and that hash+number pair is stored. Since collisions 
> almost never happen even if md5 is used the performance penalty will be 
> almost zero.
>
> The same thing has been discussed earlier and changing the hash function 
> will just solve the problem for a few years...
>
> Best regards,
> Martin

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Martin Edgar Furter Rathod <mf...@apache.org>.

On 20.01.23 22:48, Karl Fogel wrote:
> On 20 Jan 2023, Nathan Hartman wrote:
>> We already can't store files with identical SHA1 hashes, but AFAIK the
>> only meaningful impact we've ever heard is that security researchers
>> cannot track files they generate with deliberate collisions. The same
>> would be true with any hash type, for collisions within that hash
>> type.
> 
> Yes.  A hash is considered "broken" the moment security researches can 
> generate a collision.

No matter what hash function you choose now, sooner or later it will be 
broken.

But a broken hash function can still be good enough for use in tools 
like subversion if it is used correctly. Instead of just storing the 
hash value subversion should also store a sequence number. Whenever a 
collision happens subversion has to compare the two (or more) files 
which have the same hash value. If the files are identical the old 
hash+number pair is stored. If they differ the new file gets a new 
sequence number and that hash+number pair is stored. Since collisions 
almost never happen even if md5 is used the performance penalty will be 
almost zero.

The same thing has been discussed earlier and changing the hash function 
will just solve the problem for a few years...

Best regards,
Martin

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Karl Fogel <kf...@red-bean.com>.

On 20 Jan 2023, Nathan Hartman wrote:
>Taking a step back, this discussion started because pristine-free 
>WCs
>are IIUC more dependent on comparing hashes than pristineful WCs, 
>and
>therefore a hash collision could have more impact in a 
>pristine-free
>WC. "Guarantees" were mentioned, but I think it's important to 
>state
>that there's only a guarantee of probability, since as mentioned 
>above
>all hashes will have collisions.

Sure, in a literal mathematical sense, but not in a sense that 
matters for our purposes here.

In the absence of an intentionally caused collision, a good hash 
function has *far* less chance of accidental collision than, say, 
the chance that your CPU will malfunction due to a stray cosmic 
ray, or the chance of us getting hit by a planet-destroying 
meteorite tomorrow.

For our purposes, "guarantee" is accurate.  No guarantee we make 
can be stonger than the inverse probability of a CPU/memory 
malfunction anyway.

>We already can't store files with identical SHA1 hashes, but 
>AFAIK the
>only meaningful impact we've ever heard is that security 
>researchers
>cannot track files they generate with deliberate collisions. The 
>same
>would be true with any hash type, for collisions within that hash
>type.

Yes.  A hash is considered "broken" the moment security researches 
can generate a collision.

FWIW, in one of my previous posts, I described a real-life 
scenario in which the ability to generate a chosen-plaintext 
collision in an SVN working copy would have security implications.

Best regards,
-Karl

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Nathan Hartman <ha...@gmail.com>.

On Fri, Jan 20, 2023 at 7:18 AM Daniel Shahaf <d....@daniel.shahaf.name> wrote:
>
> Evgeny Kotkov via dev wrote on Thu, 19 Jan 2023 18:52 +00:00:
> > I can complete the work on this branch and bring it to a production-ready
> > state, assuming there are no objections.
>
> Your assumption is counterfactual:
>
> https://mail-archives.apache.org/mod_mbox/subversion-dev/202301.mbox/%3C20230119152001.GA27446%40tarpaulin.shahaf.local2%3E
>
> https://mail-archives.apache.org/mod_mbox/subversion-dev/202212.mbox/%3CCAMHy98NqYBLZaTL5-FAbf24RR6bagPN1npC5gsZenewZb0-EuQ%40mail.gmail.com%3E
>
> Objections have been raised, been left unanswered, and now
> implementation work has commenced following the original design.  That's
> not acceptable.  I'm vetoing the change until a non-rubber-stamp design
> discussion has been completed on the public dev@ list.

I think we can start by discussing some of the pros and cons.

There are two separate things here but they end up being mixed
together in the discussions:

1. Pros/cons of switching from SHA1 to another hash.
2. Supporting different hash types in f32.

Regarding the first item:

Do we need to switch from SHA1 to another hash? One con that was
already mentioned [1] is that we'll never really be able to switch
away from SHA1, as there are existing clients, servers, and working
copies out there. Not only will we have to support SHA1 forever for
backwards compatibility, but any new hash that is ever added will need
to be supported forever as well. If we accumulate many of those, it
might become a burden, but perhaps there will be only one new hash and
it will be the "blessed" one for the next 20 years.

There were concerns about collisions; since the space of possible
input datasets is infinite and the hash code size is fixed and finite
(pretty large, but very much finite), there will always be collisions
with any hash. The significant questions are: how small is the
probability of a collision, and (for the purposes of security) how
hard is it to generate input data that produces a collision? The
answer to the first question is fixed; the second one is probably
expected to change over time, as algorithms are studied and new
vulnerabilities are found. Which hash type do you pick, and who knows
if a hash thought to be very strong (today) later proves easier to
crack than one that is thought not as strong? We can only guess.

Taking a step back, this discussion started because pristine-free WCs
are IIUC more dependent on comparing hashes than pristineful WCs, and
therefore a hash collision could have more impact in a pristine-free
WC. "Guarantees" were mentioned, but I think it's important to state
that there's only a guarantee of probability, since as mentioned above
all hashes will have collisions.

We already can't store files with identical SHA1 hashes, but AFAIK the
only meaningful impact we've ever heard is that security researchers
cannot track files they generate with deliberate collisions. The same
would be true with any hash type, for collisions within that hash
type.

Advantages of switching to a new hash type might include: reducing the
already small probability of collisions; choosing an algorithm that is
faster or that has (or is expected to have in the future) hardware
acceleration on commodity systems, perhaps addressing user perception
(if SHA1 is seen as old and uncool), but then again, we can't really
get rid of SHA1...

[1] https://lists.apache.org/thread/v3dv1dtod2t9yrf920h4838g2t0l94cw

Regarding the second item:

Since the premise of this feature is to support adding new hash types
without bumping wc formats, it follows that any new hash type will
create compatibility problems for clients that support f32 but not the
specific new hash type. In light of that, it might just be better to
bump the wc format and then you know at the outset that you need to
upgrade your client. Just thinking out loud here but this might be
(partly) mitigated by trying to guess which hash types we might want
in the future and supporting them now, even if no existing client will
actually use them, but I don't really like this idea.

I'll have to return later with more thoughts...

Cheers,
Nathan

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Evgeny Kotkov via dev wrote on Thu, 19 Jan 2023 18:52 +00:00:
> I can complete the work on this branch and bring it to a production-ready
> state, assuming there are no objections.

Your assumption is counterfactual:

https://mail-archives.apache.org/mod_mbox/subversion-dev/202301.mbox/%3C20230119152001.GA27446%40tarpaulin.shahaf.local2%3E

https://mail-archives.apache.org/mod_mbox/subversion-dev/202212.mbox/%3CCAMHy98NqYBLZaTL5-FAbf24RR6bagPN1npC5gsZenewZb0-EuQ%40mail.gmail.com%3E

Objections have been raised, been left unanswered, and now
implementation work has commenced following the original design.  That's
not acceptable.  I'm vetoing the change until a non-rubber-stamp design
discussion has been completed on the public dev@ list.

Daniel

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Karl Fogel <kf...@red-bean.com>.

On 19 Jan 2023, Evgeny Kotkov wrote:
>To have a more or less accurate estimate, I went ahead and 
>prepared the
>first-cut implementation of an approach that makes the pristine 
>checksum
>kind configurable in a working copy.
>
>The current implementation passes all tests in my environment and 
>seems to
>work in practice.  It is available on the branch:
>
>  https://svn.apache.org/repos/asf/subversion/branches/pristine-checksum-kind
>
>The implementation on the branch allows creating working copies 
>that use a
>checksum kind other than SHA-1.
>
>The checksum kind is persisted in the settings table.  Upgraded 
>working copies
>of the older formats will have SHA-1 recorded as their pristine 
>checksum kind
>and will continue to use it for compatibility.  Newly created 
>working copies
>of the latest format (with --compatible-version=1.15 or 
>--store-pristine=no),
>as currently implemented, will use the new pristine checksum 
>kind.
>
>Currently, as a proof-of-concept, the branch uses salted SHA-1 as 
>the new
>pristine checksum kind.  For the production-ready state, I plan 
>to support
>using multiple new checksum types such as SHA-256.  I think that 
>it would
>be useful for future compatibility, because if we encounter any 
>issues with
>one checksum kind, we could then switch to a different kind 
>without having
>to change the working copy format.
>
>One thing worth noting is that ra_serf contains a specific 
>optimization for
>the skelta-style updates that allows skipping a GET request if 
>the pristine
>store already contains an entry with the specified SHA-1 
>checksum.  Switching
>to a different checksum type for the pristine entries is going to 
>disable
>that specific optimization.  Re-enabling it would require an 
>update of the
>server-side.  I consider this to be out of scope for this branch.
>
>I can complete the work on this branch and bring it to a 
>production-ready
>state, assuming there are no objections.

This sounds great to me; thank you, Evgeny.  I agree that the 
server-side companion change is (or anyway can be) out-of-scope 
here -- the perfect should not be the enemy of the good, etc.

Best regards,
-Karl

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Evgeny Kotkov via dev <de...@subversion.apache.org>.

Karl Fogel <kf...@red-bean.com> writes:

> Now, how hard would this be to actually implement?

To have a more or less accurate estimate, I went ahead and prepared the
first-cut implementation of an approach that makes the pristine checksum
kind configurable in a working copy.

The current implementation passes all tests in my environment and seems to
work in practice.  It is available on the branch:

  https://svn.apache.org/repos/asf/subversion/branches/pristine-checksum-kind

The implementation on the branch allows creating working copies that use a
checksum kind other than SHA-1.

The checksum kind is persisted in the settings table.  Upgraded working copies
of the older formats will have SHA-1 recorded as their pristine checksum kind
and will continue to use it for compatibility.  Newly created working copies
of the latest format (with --compatible-version=1.15 or --store-pristine=no),
as currently implemented, will use the new pristine checksum kind.

Currently, as a proof-of-concept, the branch uses salted SHA-1 as the new
pristine checksum kind.  For the production-ready state, I plan to support
using multiple new checksum types such as SHA-256.  I think that it would
be useful for future compatibility, because if we encounter any issues with
one checksum kind, we could then switch to a different kind without having
to change the working copy format.

One thing worth noting is that ra_serf contains a specific optimization for
the skelta-style updates that allows skipping a GET request if the pristine
store already contains an entry with the specified SHA-1 checksum.  Switching
to a different checksum type for the pristine entries is going to disable
that specific optimization.  Re-enabling it would require an update of the
server-side.  I consider this to be out of scope for this branch.

I can complete the work on this branch and bring it to a production-ready
state, assuming there are no objections.


Thanks,
Evgeny Kotkov

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Karl Fogel <kf...@red-bean.com>.

On 29 Dec 2022, Evgeny Kotkov wrote:
>Karl Fogel <kf...@red-bean.com> writes:
>
>> Now, how hard would this be to actually implement?
>
>I plan to take a more detailed look at that, but I'm currently on 
>vacation
>for the New Year holidays.

That's great to hear, Evgeny.  In the meantime, enjoy your 
vacation!

Best regards,
-Karl

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Evgeny Kotkov via dev <de...@subversion.apache.org>.

Karl Fogel <kf...@red-bean.com> writes:

> Now, how hard would this be to actually implement?

I plan to take a more detailed look at that, but I'm currently on vacation
for the New Year holidays.


Thanks,
Evgeny Kotkov

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Karl Fogel <kf...@red-bean.com>.

On 28 Dec 2022, Branko Čibej wrote:
>My point was that we shouldn't have to worry about format bumps 
>as
>much any more because we have infrastructure in the client for
>supporting multiple WC formats. That includes optional pristines,
>different hashes, compressed pristines, etc. etc.

Thank you for the reminder -- that is indeed important here.

On 28 Dec 2022, Daniel Sahlberg wrote:
>Since we need to be backwards compatible with older v1 clients, 
>can
>this check ever be removed (before Subversion 2)?
>
>So, while I believe f32 is a good opportunity to switch to a new
>hash, what is the problem we would like to solve with a new hash?

As I said before, even if we couldn't think of a concrete problem 
right now, the mere fact that a former guarantee [1] has become a 
non-guarantee is enough motivation.  We can't anticipate all the 
problems that might arise from people being able to craft local 
content that looks unmodified to Subversion.  (As you implied, 
r1794611 has no effect for content that is never committed to the 
repository.)

Of course, my saying "This matters just through reasoning from 
first principles, therefore we should fix it" would count for a 
lot more if I were volunteering to fix it, which I'm not alas. 
But I do think we don't need to search further for justifications. 
What we already know is enough: our hash algorithm is known to be 
collidable, yet what we're using it for depends on 
non-collidability; therefore, switching to a better algorithm is a 
good idea.

However, it needn't be a blocker for the next release, for the 
reason Brane gave.

Best regards,
-Karl

[1] "Former guarantee" meaning "former guarantee for all practical 
purposes", of course, since in the past there weren't ways to make 
collisions happen.

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Branko Čibej <br...@apache.org>.

On 28.12.2022 13:34, Daniel Sahlberg wrote:
> Since we need to be backwards compatible with older v1 clients, can 
> this check ever be removed (before Subversion 2)?

The case you're citing is specific to the repository, you could easily 
have a repository format that uses different hashes. The same for the RA 
layer, where we have capability negotiation; likewise for the WC. We'll 
always need compatibility with older formats, but a new enough client 
and server could use, e.g., SHA-256 or -512 all the way from WC to 
repository.

> So, while I believe f32 is a good opportunity to switch to a new hash, 
> what is the problem we would like to solve with a new hash?

On the other hand, there can be no "switching to" a new hash, because 
you don't know what the server actually supports -- hence, we'll always 
have to keep SHA-1 around. :) IMO Karl described one possible attack 
vector, and given the context (Wordpress...) it's probably only a matter 
of time before it happens.

-- Brane

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Daniel Sahlberg <da...@gmail.com>.

Den ons 28 dec. 2022 kl 08:48 skrev Branko Čibej <br...@apache.org>:

> On 27.12.2022 02:56, Karl Fogel wrote:
>
> Now, how hard would this be to actually implement?  The
> pristineless-format WC upgrade is an opportunity to make other format
> changes, but I'd hate to block the release of pristineless working copies
> on this...
>
>
> My point was that we shouldn't have to worry about format bumps as much
> any more because we have infrastructure in the client for supporting
> multiple WC formats. That includes optional pristines, different hashes,
> compressed pristines, etc. etc.
>

Evgeny has a point that when going from 31 to 32, we know that all
pristines are there and we can rehash them in place. If/when we create
format X with the new XYZ-hash, we either have to download all missing
pristines or we have to support multiple hashes for each file.

I've been thinking about this question and while I don't know all
background, it seems to be two different questions:
- Detecting changes in the WC. Karl has an excellent scenario where this
might be a problem, but switching to a new hash only makes this scenario
more expensive. Thus: What is the definition of "expensive enough"? I
believe this is a different way of asking the same question posed by
DanielSh about the criteria for a new hash.
- Storing files with hash collisions. Subversion prevents this (with
E160067) and as far as I understand this is because of r1794611 (by Stefan
Sperling) and the log message argues:

[[[
However, similar problems still exist in (at least) the RA layer and the
working copy. Until those are fixed, rejecting content which causes a hash
collision is the safest approach and avoids the undesired consequences of
storing such content.
]]]

Since we need to be backwards compatible with older v1 clients, can this
check ever be removed (before Subversion 2)?

So, while I believe f32 is a good opportunity to switch to a new hash, what
is the problem we would like to solve with a new hash?

Kind regards,
Daniel Sahlberg

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Branko Čibej <br...@apache.org>.

On 27.12.2022 02:56, Karl Fogel wrote:
> Now, how hard would this be to actually implement?  The 
> pristineless-format WC upgrade is an opportunity to make other format 
> changes, but I'd hate to block the release of pristineless working 
> copies on this...

My point was that we shouldn't have to worry about format bumps as much 
any more because we have infrastructure in the client for supporting 
multiple WC formats. That includes optional pristines, different hashes, 
compressed pristines, etc. etc.

-- Brane

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Posted by Karl Fogel <kf...@red-bean.com>.

On 20 Dec 2022, Evgeny Kotkov via dev wrote:
>[Moving discussion to a new thread]
>
>We currently have a problem that a working copy relies on the 
>checksum type
>with known collisions (SHA1).  A solution to that problem is to 
>switch to a
>different checksum type without known collisions in one of the 
>newer working
>copy formats.
>
>Since we plan on shipping a new working copy format in 1.15, this 
>seems to
>be an appropriate moment of time to decide whether we'd also want 
>to switch
>to a checksum type without known collisions in that new format.
>
>Below are the arguments for including a switch to a different 
>checksum type
>in the working copy format for 1.15:
>
>1) Since the "is the file modified?" check now compares 
>checksums, leaving
>   everything as-is may be considered a regression, because it 
>   would
>   introduce additional cases where a working copy currently 
>   relies on
>   comparing checksums with known collisions.
>
>2) We already need a working copy format bump for the 
>pristines-on-demand
>   feature.  So using that format bump to solve the SHA1 issue 
>   might reduce
>   the overall number of required bumps for users (assuming that 
>   we'll still
>   need to switch from SHA1 at some point later).
>
>3) While the pristines-on-demand feature is not released, 
>upgrading with a
>   switch to the new checksum type seems to be possible without 
>   requiring a
>   network fetch.  But if some of the pristines are optional, we 
>   lose the
>   possibility to rehash all contents in place.  So we might find 
>   ourselves
>   having to choose between two worse alternatives of either 
>   requiring a
>   network fetch during upgrade or entirely prohibiting an 
>   upgrade of
>   working copies with optional pristines.
>
>Thoughts?

A few thoughts:

First, Daniel Shahaf raises the question of whether there is 
really a problem here.  I.e., Why do we care about possible 
collisions when they're unlikely to happen in practice unless 
deliberately caused?

My answer is: we should care because it's very difficult to 
imagine all the consequences -- including but not limited to 
clever deliberate attacks -- that might follow from losing a 
property we formerly had.  The hash semantics we have always 
assumed are "If the file is modified, the hash will change."  When 
those semantics change, we don't need to be able to think 
immediately of a specific problematic scenario to know that this 
is a significant development.  We've lost the guarantee; that's 
enough to be worth worrying about.

BUT, if you want a scenario, here's one:

I have put WordPress installations under Subversion version 
control before.  Once, I detected an attack on one of those 
WordPress servers when one of the things the attacker did was 
modify some of the WordPress scripts on the server.  Those files 
showed up as modified when I ran 'svn st', and from there I ran 
'svn diff' and figured out what had happened.  But a super-careful 
attacker could make modifications that leave the 
version-controlled files with the same SHA1 hash they had before, 
thus making it harder to detect the attack.

Yes, I realize there are other ways to detect modifications, and 
that random attackers are unlikely to take the trouble to preserve 
hashes.  On the other hand, a well-resourced spear-fishing 
attacker who knows something about the usage of SVN at their 
target might indeed try a hash-preserving approach to breaking in. 
The point is, if we're counting on the hashes having certain 
semantics, then our users are counting on it too.  If SHA1 no 
longer has those semantics, we should upgrade.

Second, +1 to what Branko said: we should upgrade to a new hash 
when we upgrade a working copy anyway, but new clients should 
still be able to handle the old hash in old working copies without 
upgrading them.

Now, how hard would this be to actually implement?  The 
pristineless-format WC upgrade is an opportunity to make other 
format changes, but I'd hate to block the release of pristineless 
working copies on this...

Best regards,
-Karl

Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format (was: Re: Getting to first release of pristines-on-demand feature (#525).)

Posted by Evgeny Kotkov via dev <de...@subversion.apache.org>.

Karl Fogel <kf...@red-bean.com> writes:

> > While here, I would like to raise a topic of incorporating a switch from
> > SHA1 to a different checksum type (without known collisions) for the new
> > working copy format.  This topic is relevant to the pristines-on-demand
> > branch, because the new "is the file modified?" check relies on the
> > checksum comparison, instead of comparing the contents of working and
> > pristine files.
> >
> > And so while I consider it to be out of the scope of the pristines-on-
> > demand branch, I think that we might want to evaluate if this is something
> > that should be a part of the next release.
>
> Good point.  Maybe worth a new thread?

[Moving discussion to a new thread]

We currently have a problem that a working copy relies on the checksum type
with known collisions (SHA1).  A solution to that problem is to switch to a
different checksum type without known collisions in one of the newer working
copy formats.

Since we plan on shipping a new working copy format in 1.15, this seems to
be an appropriate moment of time to decide whether we'd also want to switch
to a checksum type without known collisions in that new format.

Below are the arguments for including a switch to a different checksum type
in the working copy format for 1.15:

1) Since the "is the file modified?" check now compares checksums, leaving
   everything as-is may be considered a regression, because it would
   introduce additional cases where a working copy currently relies on
   comparing checksums with known collisions.

2) We already need a working copy format bump for the pristines-on-demand
   feature.  So using that format bump to solve the SHA1 issue might reduce
   the overall number of required bumps for users (assuming that we'll still
   need to switch from SHA1 at some point later).

3) While the pristines-on-demand feature is not released, upgrading with a
   switch to the new checksum type seems to be possible without requiring a
   network fetch.  But if some of the pristines are optional, we lose the
   possibility to rehash all contents in place.  So we might find ourselves
   having to choose between two worse alternatives of either requiring a
   network fetch during upgrade or entirely prohibiting an upgrade of
   working copies with optional pristines.

Thoughts?

Thanks,
Evgeny Kotkov

Re: Getting to first release of pristines-on-demand feature (#525).

Posted by Karl Fogel <kf...@red-bean.com>.

On 07 Dec 2022, Evgeny Kotkov wrote:
>Evgeny Kotkov <ev...@visualsvn.com> writes:
>I think that the `pristines-on-demand-on-mwf` branch is now ready 
>for a
>merge to trunk.  I could do that, assuming there are no 
>objections.

+1, and thank you.  

Now, I haven't had time to do a real code review -- my manager hat 
gets tighter every year -- so my "+1" is mainly a sign of 
enthusiasm for the feature, and of general trust in our test suite 
and in everyone who has worked on this.

>  https://svn.apache.org/repos/asf/subversion/branches/pristines-on-demand-on-mwf
>
>The branch includes the following:
>– Core implementation of the new mode where required pristines 
>are fetched
>  at the beginning of the operation.
>– A new --store-pristine=yes/no option for `svn checkout` that is 
>persisted
>  as a working copy setting.

+1 to this UI.  We can offer other gateways to this feature later, 
but this is a clean & simple way to start out.

>– An update for `svn info` to display the value of this new 
>setting.

Yay.

>– A standalone test harness that tests main operations in both
>  --store-pristine modes and gets executed on every test run.
>– A new --store-pristine=yes/no option for the test suite that 
>forces all
>  tests to run with a specific pristine mode.

Very nice. 

>The branch passes all tests in my Windows and Linux environments, 
>in both
>--store-pristine=yes and =no modes.

W00t!

>While here, I would like to raise a topic of incorporating a 
>switch from
>SHA1 to a different checksum type (without known collisions) for 
>the new
>working copy format.  This topic is relevant to the 
>pristines-on-demand
>branch, because the new "is the file modified?" check relies on 
>the checksum
>comparison, instead of comparing the contents of working and 
>pristine files.
>
>And so while I consider it to be out of the scope of the 
>pristines-on-demand
>branch, I think that we might want to evaluate if this is 
>something that
>should be a part of the next release.

Good point.  Maybe worth a new thread?

Best regards,
-Karl

Re: Getting to first release of pristines-on-demand feature (#525).

Posted by Evgeny Kotkov via dev <de...@subversion.apache.org>.

Evgeny Kotkov <ev...@visualsvn.com> writes:

> > IMHO, once the tests are ready, we could merge it and release
> > it to the world.
>
> Apart from the required test changes, there are some technical
> TODOs that remain from the initial patch and should be resolved.
> I'll try to handle them as well.

I think that the `pristines-on-demand-on-mwf` branch is now ready for a
merge to trunk.  I could do that, assuming there are no objections.

  https://svn.apache.org/repos/asf/subversion/branches/pristines-on-demand-on-mwf

The branch includes the following:
– Core implementation of the new mode where required pristines are fetched
  at the beginning of the operation.
– A new --store-pristine=yes/no option for `svn checkout` that is persisted
  as a working copy setting.
– An update for `svn info` to display the value of this new setting.
– A standalone test harness that tests main operations in both
  --store-pristine modes and gets executed on every test run.
– A new --store-pristine=yes/no option for the test suite that forces all
  tests to run with a specific pristine mode.

The branch passes all tests in my Windows and Linux environments, in both
--store-pristine=yes and =no modes.

While here, I would like to raise a topic of incorporating a switch from
SHA1 to a different checksum type (without known collisions) for the new
working copy format.  This topic is relevant to the pristines-on-demand
branch, because the new "is the file modified?" check relies on the checksum
comparison, instead of comparing the contents of working and pristine files.

And so while I consider it to be out of the scope of the pristines-on-demand
branch, I think that we might want to evaluate if this is something that
should be a part of the next release.

Thanks,
Evgeny Kotkov

Re: Getting to first release of pristines-on-demand feature (#525).

Posted by Karl Fogel <kf...@red-bean.com>.

On 29 Nov 2022, Johan Corveleyn wrote:
>My thanks also to the courageous people having developed this, 
>and the
>gentle souls keeping the ball rolling :-).
>
>About the name:
>
>> [...]
>
>FWIW, my vote still goes to --store-pristines={yes|no}

Same here, FWIW.

I understand the argument that this exposes an "implementation 
detail" that the user is supposed to not need to think about.  But 
remember, the reason we developed this feature is because the user 
was *already* exposed to the existence of pristines: disk space 
usage by pristines is quite visible to the user -- that's the 
whole problem :-).

So only users who already "see" pristines -- that is, who are 
already aware of the storage issue -- would go looking for this 
feature in the first place.  So by the time they learn about the 
'--store-pristines' option, they're already being forced to deal 
with pristines as a concept, and the only question is whether the 
tool we give them to solve their problem will take advantage of 
that conceptual familiarity.

So, +1 to "--store-pristines=foo".

>I prefer such an explicit option here, rather than vague ones 
>that
>could cover many different things. Also, --optimize=X can easily 
>be
>interpreted inversely as intended (for instance: when I have an
>optimal network, do I use --optimize=network?)
>
>Apart from {yes|no} the feature might grow other option values in 
>the
>future ('size-based' or 'text-only', or maybe simply 'auto' if we 
>come
>up with a good general strategy that works for 99% of the cases, 
>the
>details of which we don't want to burden our users with). We 
>could
>even, in some distant future, allow user-defined names that are
>specified in ~/.subversion/config by the user (using some syntax 
>where
>the user can set configurable size limits or mime-types or 
>whatever).

I also agree with Johan's point here.

>One other suggestion: not a blocker of course, but a
>runtime-config-area default would be nice :-). Users might want 
>to
>choose the same option all the time, without having to remember 
>to add
>the option to their checkout command.
>
>Something like, in ~/.suversion/config
>
>store-pristines-default={yes|no}

Later on, this might grow into more sophisticated local run-time 
config regarding pristines, but for now, providing this basic 
yes/no default is a good idea.  For example, on machines where one 
is regularly checking out trees with huge files, one might set the 
default to "no".

Best regards,
-Karl

Re: Getting to first release of pristines-on-demand feature (#525).

Posted by Johan Corveleyn <jc...@gmail.com>.

My thanks also to the courageous people having developed this, and the
gentle souls keeping the ball rolling :-).

About the name:

On Thu, Nov 24, 2022 at 3:57 PM Nathan Hartman <ha...@gmail.com> wrote:
...
> Previously we got stuck trying to choose the user-facing name of this
> feature and its command line switches.
>
> Currently the CLI switch is --store-pristine={yes|no}.
>
> I'm okay with this, but for completeness I'll mention that earlier in
> the year there was a little bit of push back because pristines, up
> until now, have been an internal implementation detail that users
> needn't concern themselves with. (Except that they double the storage
> space...)
>
> I've been trying to think of something better for months now, and
> here's what I've come up with:
>
> --optimize=storage
> --optimize=network

FWIW, my vote still goes to --store-pristines={yes|no}

I prefer such an explicit option here, rather than vague ones that
could cover many different things. Also, --optimize=X can easily be
interpreted inversely as intended (for instance: when I have an
optimal network, do I use --optimize=network?)

Apart from {yes|no} the feature might grow other option values in the
future ('size-based' or 'text-only', or maybe simply 'auto' if we come
up with a good general strategy that works for 99% of the cases, the
details of which we don't want to burden our users with). We could
even, in some distant future, allow user-defined names that are
specified in ~/.subversion/config by the user (using some syntax where
the user can set configurable size limits or mime-types or whatever).

One other suggestion: not a blocker of course, but a
runtime-config-area default would be nice :-). Users might want to
choose the same option all the time, without having to remember to add
the option to their checkout command.

Something like, in ~/.suversion/config

store-pristines-default={yes|no}

Just my 2 cents of course ...
-- 
Johan

Re: Getting to first release of pristines-on-demand feature (#525).

Posted by Nathan Hartman <ha...@gmail.com>.

On Wed, Nov 23, 2022 at 9:53 AM Julian Foad <ju...@apache.org> wrote:
> Nathan, I see you replied enthusiastically and mentioned "I have much to
> say on both of these [TODOs] but I won't go into detail yet...". It
> seems to me it could be helpful to get that started sooner rather than
> later, too, if those issues still need hashing out.

Thanks for the nudge.

Previously we got stuck trying to choose the user-facing name of this
feature and its command line switches.

Currently the CLI switch is --store-pristine={yes|no}.

I'm okay with this, but for completeness I'll mention that earlier in
the year there was a little bit of push back because pristines, up
until now, have been an internal implementation detail that users
needn't concern themselves with. (Except that they double the storage
space...)

I've been trying to think of something better for months now, and
here's what I've come up with:

--optimize=storage
--optimize=network

Rationale:

* Self-documenting.

* Easy to explain: --optimize=storage saves storage space;
  --optimize=network reduces network accesses to the repository
  server.

* Users don't need to know about pristines. There aren't several levels
  of abstraction between the option name and why the user cares about
  it.

* Extensible. Maybe we can think of other ways to optimize for network
  bandwidth, for example.

The docs can give more user-facing explanation, including tradeoffs,
which SVN operations are affected, and example scenarios to help users
choose. It should be much easier to write -- and read -- than what we
currently have at the draft release notes [1].

As for example scenarios, while the original premise was to save space
on large files that don't change often, i525pod is also great in other
situations, such as checking out a large source tree on a ramdrive
(limited space), or on the same machine as the repo, or on a storage-
limited embedded device. (I've tried i525pod in all 3 of these
scenarios!)

Downsides:

* Admittedly, --optimize=network isn't the best name in all scenarios.
  Notably, this is a misnomer when the repository server is on the same
  machine as the working copy, but that might not matter because it's
  the default. (And I might suggest trying --optimize=storage in that
  scenario).

* If we ever want to do other cool things with pristines, such as an
  option to keep more locally cached history, these names won't be
  right for that.

* These option names haven't helped me come up with a better name for
  the feature itself.

There is an advantage to using --store-pristine={yes|no}: We don't need
to rename the feature because Pristines On Demand and the CLI options
are named similarly.

The disadvantage of --store-pristine={yes|no} is that the feature is
more burdensome for us to explain and for others to learn about,
especially from a non-technical standpoint. How would you explain this
feature in a press release, or in a short blurb (or dare I say, tweet)
about "What's new in Subversion 1.15?"

Some other possibilities that were discussed:

I'll mention these for completeness but note that if --optimize=x is
shot down, I'd rather use --store-pristine={yes|no} than any of these:

* Hydrate and dehydrate -- perhaps the terms that appear most in dev
  discussions. I don't recommend these in user-facing areas because
  they aren't self-documenting. Users can't deduce what these actually
  do for the user. Users might mistakenly think that their working
  files would be hydrated or dehydrated in some way. Users would have
  to learn about pristines to know what is being hydrated or
  dehydrated, eliminating any useful abstraction.

* "Bare working copies" -- the draft release notes [1] use this term
  tentatively to explain that "bare" working copies save storage by not
  caching "BASE" files. Unfortunately, "bare" and "BASE" differ by only
  one letter (and capitalization) and I feel like the explanation is
  too complicated and doesn't bring us closer to a good result.

* Briefly discussed: "local BASE" or "remote BASE" -- but that's a
  misnomer because there's no such thing as "remote" BASE.

Well, you've been warned that I have much to say. :-)

Cheers,
Nathan

Re: Getting to first release of pristines-on-demand feature (#525).

Posted by Julian Foad <ju...@apache.org>.

I'm glad to see you all picking up this project again. While working on
this at the beginning of the year I turned on the pristines-on-demand
mode in some of my own WCs such as my 'Documents' tree which includes
lots of scanned paper docs. It works nicely for cases like this, and
feels right, the pristine store being mostly unpopulated when the
working files are mostly unchanging.

I meant to check back with you during the year, how we should take it
forward. The recent summary in this thread sounds about right. My own
capacity to contribute is steadily decreasing. So, thank you, dev
community: it's good to see people working together to make it happen.
It would be pleasing to see this being brought to a satisfactory state
and released.

Nathan, I see you replied enthusiastically and mentioned "I have much to
say on both of these [TODOs] but I won't go into detail yet...". It
seems to me it could be helpful to get that started sooner rather than
later, too, if those issues still need hashing out.

- Julian

Re: Getting to first release of pristines-on-demand feature (#525).

Posted by Karl Fogel <kf...@red-bean.com>.

On 16 Nov 2022, Evgeny Kotkov wrote:
>Apart from the required test changes, there are some technical
>TODOs that remain from the initial patch and should be resolved.
>I'll try to handle them as well.

Thank you!

Re: Getting to first release of pristines-on-demand feature (#525).

Posted by Evgeny Kotkov via dev <de...@subversion.apache.org>.

Karl Fogel <kf...@red-bean.com> writes:

> Thank you, Evgeny!  Just to make sure I understand correctly --
> the status now on the 'pristines-on-demand-on-mwf' branch is:
>
> 1) One can do 'svn checkout --store-pristines=no' to get an
> entirely pristine-less working copy.  In that working copy,
> individual files will be hydrated/dehydrated automagically on an
> as-needed basis.
>
> 2) There is no command to hydrate or dehydrate a particular file.
> Hydration and dehydration only happen as a side effect of other
> regular Subversion operations.
>
> 3) There is no way to rehydrate the entire working copy.  E.g.,
> something like 'svn update --store-pristines=yes' or 'svn hydrate
> --depth=infinity' does not exist yet.
>
> 4) Likewise, there is no way to dehydrate an existing working copy
> that currently has its pristines (even if that working copy is at
> a high-enough version format to support pristinelessness).  E.g.,
> something like 'svn update --store-pristines=no' or 'svn dehydrate
> --depth=infinity' does not exist yet.
>
> Is that all correct?

Yes, I believe that is correct.

> By the way, I do not think (2), (3), and (4) are blockers.  Just
> (1) by itself is a huge step forward and solves issue #525;

+1 on keeping the scope of the feature to just (1) for now.

> IMHO, once the tests are ready, we could merge it and release
> it to the world.

Apart from the required test changes, there are some technical
TODOs that remain from the initial patch and should be resolved.
I'll try to handle them as well.


Thanks,
Evgeny Kotkov

Re: Getting to first release of pristines-on-demand feature (#525).

Posted by Karl Fogel <kf...@red-bean.com>.

On 15 Nov 2022, Evgeny Kotkov wrote:
>Evgeny Kotkov <ev...@visualsvn.com> writes:
>
>> Perhaps we could transition into that state by committing the 
>> patch
>> and maybe re-evaluate things from there.  I could do that, 
>> assuming
>> no objections, of course.
>
>Committed the patch in https://svn.apache.org/r1905324
>
>I'll try to handle the related tasks in the near future.

Thank you, Evgeny!  Just to make sure I understand correctly -- 
the status now on the 'pristines-on-demand-on-mwf' branch is:

1) One can do 'svn checkout --store-pristines=no' to get an 
entirely pristine-less working copy.  In that working copy, 
individual files will be hydrated/dehydrated automagically on an 
as-needed basis.

2) There is no command to hydrate or dehydrate a particular file. 
Hydration and dehydration only happen as a side effect of other 
regular Subversion operations.

3) There is no way to rehydrate the entire working copy.  E.g., 
something like 'svn update --store-pristines=yes' or 'svn hydrate 
--depth=infinity' does not exist yet.

4) Likewise, there is no way to dehydrate an existing working copy 
that currently has its pristines (even if that working copy is at 
a high-enough version format to support pristinelessness).  E.g., 
something like 'svn update --store-pristines=no' or 'svn dehydrate 
--depth=infinity' does not exist yet.

Is that all correct?

By the way, I do not think (2), (3), and (4) are blockers.  Just 
(1) by itself is a huge step forward and solves issue #525; IMHO, 
once the tests are ready, we could merge it and release it to the 
world.

Best regards,
-Karl

Re: Getting to first release of pristines-on-demand feature (#525).

Posted by Evgeny Kotkov via dev <de...@subversion.apache.org>.

Evgeny Kotkov <ev...@visualsvn.com> writes:

> Perhaps we could transition into that state by committing the patch
> and maybe re-evaluate things from there.  I could do that, assuming
> no objections, of course.

Committed the patch in https://svn.apache.org/r1905324

I'll try to handle the related tasks in the near future.


Thanks,
Evgeny Kotkov

Re: Getting to first release of pristines-on-demand feature (#525).

Posted by Evgeny Kotkov via dev <de...@subversion.apache.org>.

Karl Fogel <kf...@red-bean.com> writes:

> By the way, in that thread, Evgeny Kotkov -- whose initial work
> much of this is based on -- follows up with a patch that does a
> first-pass implementation of 'svn checkout --store-pristines=no'
> (by implementing a new persistent setting in wc.db).

Perhaps we could transition into that state by committing the patch
and maybe re-evaluate things from there.  I could do that, assuming
no objections, of course.

Thanks,
Evgeny Kotkov