You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Julian Foad <ju...@apache.org> on 2022/03/10 21:27:43 UTC

Issue #525/#4892: on only fetching the pristines we really need

This is an investigation into changing the "pristines-on-demand"
approach to follow a principle that each operation would only fetch the
pristines it really needs.

I have begun a "user guide" ( notes/i525/i525-user-guide.md ), with the
aim of explaining the principle of operation of the current approach,
along with its expectations and limitations. Note well that the current
approach is based on a *different* principle from "each operation only
fetches the pristines it really needs".

As a reminder, the present design is based on a fetching paradigm that
is up-front and pessimistic: before any operation that *might* need
pristines, it ensures it fetched sufficient (but perhaps more than
necessary) pristines. After that fetching phase (see
'svn_client__textbase_sync'), it then runs the original operation code
path, assured that the operation will run correctly in its existing
form, without needing to be modified to support fetching via a deep
(point of use) callback.


Online vs offline operations
----------------------------

I want to draw a distinction, which may or may not help here, between
operations that were already "online" (required contacting the repo) and
those that were previously "offline" (local only).

The previously "online" operations include "update" of course, and
"switch" and "checkout --force" (both being sisters of update), and
"merge", and the forms of "diff" that compare base to repository.

Any online operation is going to connect to the repository anyway, in
its normal (previous) operation. When the current design deems that such
an operation needs to hydrate the pristines before it starts, this
"need" is more of a "uses in its current implementation". In principle
we could change its implementation to move the fetching of pristines
down the call stack to the point where it actually needs them, and so
ensure optimal fetching – in the sense of fetching only those it really
needs, and only when it really needs them.

This change would cause an increase in network traffic whenever a needed
pristine is missing; but only an increase. Because these operations are
already online, it would not cause any substantial qualitative
difference to the user experience or to the high level client software's
need to handle repository connection and authentication.

Now contrast this with the previously "offline" operations.

If we change a previously "offline" operation (local diff, revert, etc.)
to fetch only the pristines it actually needs, by pushing the fetch
callbacks down the call stack to the point of use, that would lead to a
qualitatively different user experience and high level client software
usage pattern. (Previously discussed. In short: the callback and need
for authentication, which may require user input, may come at any point
after the operation has started, where for example a GUI tool may be in
the middle of displaying a series of file diffs.) I do not know how much
of an issue that might be, but some people have expressed concern.

Perhaps a useful compromise could be:

  - for the "online" operations only, fetch at the point of use
(optimal: only fetching the pristines they actually need); and
  - retain the pessimistic up-front sync paradigm for the "offline"
operations (so avoiding the callback awkwardness for them).

That's just for consideration, not a strong recommendation.

Now, let us take a look at "update" in particular, because it came up as
a problem in a primary use case that prompted me to file issue #4892.


Why and how does "update" currently require pristines?
------------------------------------------------------

Note that update involves TWO pristines for each file: the old one
that corresponds to the old base revision before the update, and the new
one that corresponds to the new base revision after the update.

Update currently uses pristines in two distinct ways:

  - [deltas] The update code reports the needed update in terms of a
delta against the (old) base revision, on the assumption that the client
has a pristine copy of the base revision. The repository duly sends such
a delta. The WC layer then attempts to apply the delta it receives, and
at that point attempts to open and read the old pristine, in order to
apply the delta to create the new pristine.

  - [restore] The update code also looks for files that are missing on
disk (if the 'restore_files' option is passed, which it usually is), and
restores them by reading and translating their pristines. It restores
files on the reporting side (in svn_wc_crawl_revisions5), before
reporting the state of each file.

What would it take to modify "update" to fetch at point of use?
---------------------------------------------------------------

For the Deltas:
---------------

The relevant sub-case is a file with local modifications. (For an
unmodified file it can reconstruct the pristine on the fly.)

If the working file has local modifications, then after the base is
updated, there is a 3-way merge to update the working file, which needs
to read both the old pristine and the new pristine.

Possible approach:

  - If the working file is *unmodified* and the pristine is missing,
on the reporting side, report that the current version is empty
(whatever the appropriate incantation is for that), to request the
server to send the whole new file (a.k.a. delta against empty). The
receiver (apply-delta) will then not need to read the old pristine, and
will store the result as the new pristine, as usual. No 3-way merge is
needed to update the working file; instead, translate the new pristine.

  - If the working file is *locally modified* and the pristine is
missing, on the reporting side, first fetch its current (old) pristine.
Then everything proceeds as before: report the current (old) base
revision, thereby asking the server to send a delta against that
pristine. That (old) pristine will be available for use in the 3-way merge.

For the Restores:
-----------------

We would need to do this:

  - If a file needs to be restored and its pristine is missing, first
fetch it via callback.

  - Don't leave it in the pristine store afterwards, because by
definition this is a case where the file is unmodified. We might
implement this most simply as the sequence: fetch pristine, then
translate into working file, then clean up the pristine later. Or we
might want to optimise it into a single pass, streaming straight from
the repository through the translation into the working file, so there
is no time when disk space is needed for both the pristine copy and the
working copy simultaneously.

  - To be checked: For a file that ends up being updated later in the
update operation, it may be being restored unnecessarily at this step.
If that is the case, perhaps we can optimise by eliminating that. But
that seems to be an orthogonal optimisation, not dependent on i525.


Conclusions:
------------

It is certainly possible that we could modify "update" and the other
"online" operations, at least, and the previously "offline" operations
too if we want, to make them fetch pristines at point-of-use in this way.

Such modifications are not trivial. There is the need to run additional
RA requests in between the existing ones, perhaps needing an additional
RA session to be established in parallel, or taking care with inserting
RA requests into an existing session. There is the boilerplate
version-bumping (revving) of the APIs to pass callbacks down to the
points of use. There is probably more.



Re: Issue #525/#4892: on only fetching the pristines we really need

Posted by Julian Foad <ju...@apache.org>.
With a dive into the main "update" code, I was able to make "update"
fetch pristines at the point of use, and so minimally the ones it really
needs... I think.

So far I have only got it running with "restore" functionality disabled,
and run the test suite. I get the (ten) expected fails from the lack of
"restore", and some more. But not too many more. Just one in fact:

FAIL:  externals_tests.py 68: check file external recorded info

So this is looking promising.

Patch attached.

That's probably all for today.

- Julian

Re: Issue #525/#4892: on only fetching the pristines we really need

Posted by Julian Foad <ju...@apache.org>.
A quick dive in the "restore" code path led me to:

  - commit a small refactoring (r1898847 on trunk) to deduplicate the
code, which should be useful if we need to do anything like adding
callbacks to it;
  - observe that if we disable "restore", 10 tests fail (4 update tests
and 6 others).

It may be that the effort of adjusting test expectations and other
fall-outs (like documentation updates) might not be smaller than the
effort of implementing the callback.

I will leave off looking further at the "restore" for now, and do some
further investigation in the main update code path.

- Julian


Re: Issue #525/#4892: on only fetching the pristines we really need

Posted by Julian Foad <ju...@apache.org>.
Thinking about next steps. It seems worth investigating how feasible is
making at least "update" hydrate pristines at point-of-use (optimally/minimally).

"Restore" is an aberration
--------------------------

The first thought I had after sending that long post is that the
"restore" functionality of update is, to my mind, a historical
aberration. It's not consistent with anything: it's an intentional
revert of one of the user's local modifications, and we explicitly avoid
doing anything like that except in this one case. I don't know its
history except I believe I learnt it was copied from CVS. It would not
surprise me if in fact it was originally introduced to work around some
flaw rather as an intentional user feature. (Anyone know... Karl?)

So, it would not be totally unreasonable to declare that we
intentionally omit that functionality when in pristines-on-demand mode,
at least for the first cut. There's one simplification.

"Merge"
-------

First, an addendum to what I wrote under "Deltas: working file is
*unmodified*": I described this step leaving the new pristine in the
pristine store. But we don't want to leave it in the pristine store
afterwards, because by definition this is a case where the file is
unmodified. Options, with their pros and cons, are as I described under
the "restore" case: either to avoid the store at that point or to let it
be stored for a moment and have the clean-up remove it toward the end of
the update operation.

Now, is there any shortcut we can make on the 3-way merge aspect of update?

The use case described in [1] is now (r1898846) committed on the branch
as 'notes/i525/i525-use-case-4892-minimal-update.txt'. In that use case,
the update brings in changes to small files that are locally modified;
but it does not bring in any changes to the huge files.

We want to avoid fetching the pristine of the (huge) locally modified
files which receive no update; while the update is still expected to
merge changes into the (small) locally modified files. (I am saying
"huge" and "small" to help us visualise the case; in our current short
term plans we don't expect the update operation to change its behaviour
depending on file size.)

Before I speculate and write more on this, I will take a deeper swim in
the source code and see what more I can learn or hack together.

- Julian


[1] https://lists.apache.org/thread/t7y09576tz5xcqhwzqys3t0vfbdpg861 on
dev@ from Julian Foad at 2022-03-04T20:52:38Z.


Re: Issue #525/#4892: on only fetching the pristines we really need

Posted by Julian Foad <ju...@apache.org>.
Thank you, Evgeny. That is exactly the kind of discussion we need, and
you were able to provide far more detailed insights than I was. That
should help us decide how to proceed.

As for your thoughts about the current approach for MVP, I tend to agree
that your approach is likely to be useful for a lot of people's use
cases. Unfortunately it has turned out that Karl has one of the use
cases that doesn't match those assumptions, where it doesn't work so well.

Karl, would it make sense now for you to search and compare use cases
that concern your group, and see if cases that don't work well with the
current design (such as the particular case we are focusing on right
now, r1898846 'notes/i525/i525-use-case-4892-minimal-update.txt') are of
majority or minority concern overall?

One thing we could do is to more formally document use cases in order
for potential users to spot which ones match their cases.

One thing we could do is release (preview) versions of both approaches and
get folks to evaluate them.

We need to decide how to spend the next effort here.

- Julian


Re: Issue #525/#4892: on only fetching the pristines we really need

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.
Julian Foad wrote on Wed, Mar 16, 2022 at 20:44:08 +0000:
> We're free to continue design discussions but I've limited time and
> need to focus. To me it appears we've moved far enough along this path
> of "some of our users want to do X" leading to "let's see how far we
> can implement an alternative" and now "let's consider the user's
> work-around options, and now "but the work-around has these
> consequences; mightn't that be a problem?". It seems to me we now know
> what are the two design directions, the original which is sub-optimal
> but near ready to use, and the alternative, now begun on its own
> "-issue4892" branch.

OK.

> I want to refrain from further speculation about how willing such
> a user would be to use the original design with work-arounds, and
> rather ask them first.

+1 with a caveat: User input shouldn't the only factor in our decision
of whether to choose the workaround design.  It'd be useful information,
and it'd _support_ a particular course of action, but it wouldn't
_imply_ that particular course of action.

Cheers,

Daniel

Re: Issue #525/#4892: on only fetching the pristines we really need

Posted by Julian Foad <ju...@foad.me.uk>.
Daniel Shahaf wrote:
>Julian Foad wrote:
>> It might be *absolutely fine* for the real life users [...]
>
>So what are you saying?  That we should stop doing design discussions
>and go talk to users?  [...]

Sorry if that came across far too harsh. Looking back I see I phrased it as "What if...?" and your answer could be construed as a literal description of some technical consequences of such a work-around. What I meant was, what would those users feel about modifying their work flow with such a work-around? We don't know if any of those technical consequences would matter to them.

We're free to continue design discussions but I've limited time and need to focus. To me it appears we've moved far enough along this path of "some of our users want to do X" leading to "let's see how far we can implement an alternative" and now "let's consider the user's work-around options, and now "but the work-around has these consequences; mightn't that be a problem?". It seems to me we now know what are the two design directions, the original which is sub-optimal but near ready to use, and the alternative, now begun on its own "-issue4892" branch. I want to refrain from further speculation about how willing such a user would be to use the original design with work-arounds, and rather ask them first.


- Julian

Re: Issue #525/#4892: on only fetching the pristines we really need

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.
Julian Foad wrote on Wed, Mar 16, 2022 at 07:03:43 +0000:
> Daniel Shahaf wrote:
> >This implies the wc won't be uniform revision.  This might break user
> >expectations; might [...], 
> I'm not sure how your clarification helps us progress. The point is:
> 
> It might be *absolutely fine* for the real life users in their real life situations, and that's what we need to find out.

So what are you saying?  That we should stop doing design discussions
and go talk to users?  I agree we should talk to users, but I don't
think we can pass the design buck to them and go "Those users +1ed this
design so let's implement/release it".  There might be better ideas we
won't think of if we don't discuss things here on this list; the users
that talk to us are unlikely to be representative of all our users
anyway; and it's us, not them, who'll be promising to maintain that
design going forward.

Anyway, we can for starters call for testers on SVN-525 and on users@.

Cheers,

Daniel

Re: Issue #525/#4892: on only fetching the pristines we really need

Posted by Julian Foad <ju...@foad.me.uk>.
Daniel Shahaf wrote:
>This implies the wc won't be uniform revision.  This might break user
>expectations; might [...], 
I'm not sure how your clarification helps us progress. The point is:

It might be *absolutely fine* for the real life users in their real life situations, and that's what we need to find out.
- Julian

Re: Issue #525/#4892: on only fetching the pristines we really need

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.
Julian Foad wrote on Mon, Mar 14, 2022 at 10:47:57 +0000:
> I wonder if we are missing some perspective.
> 
> We are worried that the current design won't be acceptable because it
> has poor behaviour in a particular use case.
> 
> The use case involved running "svn update" at the root of the WC. (It
> didn't explicitly say that. More precisely, it implied the update target
> tree contains the huge locally modified file.)
> 
> Using this new feature necessarily requires some adjustments to user
> expectations and work flow.
> 
> What if we ask the user to limit their "svn update" to target the
> particular files/paths that they need to update, keeping their huge
> locally modified file out of its scope? Examples:
> 
> svn update readme.txt
> svn update small-docs/
> # BUT NOT: svn update the-whole-wc/
> 
> Then we side-step the issue. It only fetches pristines for modified
> files that are within the tree scope of the specified targets. (This is
> how it works already, not a proposal.)
> 
> OK that's not optimal but it might be sufficient.

This implies the wc won't be uniform revision.  This might break user
expectations; might prevent the user from running «svn merge»; and might
get in the way of querying the repository for the wc's history.  E.g.,
even a simple «svn diff -r BASE» on an unmodified wc will show
differences, because BASE is resolved to a revision number once at the
start, not once per versioned file.

One way in which I can see this possibly working is if the user is
willing to restructure their tree so all the large files are in one
subdirectory that's an immediate child of the wc root, and all other
files are in another immediate child directory of the wc root.  Then
they can use the latter child directory as their cwd and run svn
operations normally.  The cwd won't be the wc root, but that's
manageable.  However, requiring a tree restructure would increase the
cost to the user of starting to use this feature.

Also, at this point they can basically leave the large files in an
svn:external wc, and enable pristines-on-demand only for that wc, rather
than have one wc with two subdirs.

Cheers,

Daniel


> (Of course there are further concerns, such as what happens if the user
> starts an update at the WC root, then cancels it as it's taking too
> long: can we gracefully recover? Fine, we can look at those concerns.)
> 
> I can go ahead with further work on changing the design if required, but
> I am concerned that might not be the best use of resources. Also I don't
> know how to evaluate the balance of Evgeny's concerns about protocol
> level complexity of the alternative design, against the concerns about
> the present design. In other words pursuing that alternative seems
> riskier, while accepting the known down-sides of the current design is
> sub-optimal but seems less risky.
> 
> Should we first test the current design and see if we can work with it,
> before going full steam ahead into changing the design?
> 
> The current design/implementation (on branch
> 'pristines-on-demand-on-mwf') is in a working state. There are open
> issues that still need to be resolved, but it's complete enough to be
> ready for this level of testing.
> 
> - Julian
> 

Re: Issue #525/#4892: on only fetching the pristines we really need

Posted by Karl Fogel <kf...@red-bean.com>.
Hi -- I've just absorbed all of this thread that was new since the 
last time I read it (that's a lot! :-) ).

I agree with Julian's judgement that we should just ship the MVP 
version of our issue #525 solution with 'svn update' fetching 
pristines for locally-modified files.

While it's not ideal, it's also not a showstopper, and I don't 
think it's worth increasing time-to-ship for this feature over 
this relatively minor point.

1) It's "just" a performance issue, not a correctness issue.

2) It can be improved in the future.

3) For users who would be bitten by it, there are several easy
   workarounds available: target one's updates to avoid 
   stimulating the
   fetch-pristine behavior; or, copy a file locally before 
   operating on
   it; or, sequence one's work so as to only have one modified 
   file
   around at a given time.

The 525-enabled Subversion will still be a huge win, even with 
this small blemish.

Many thanks to Evgeny for pointing out the complexities, likewise 
to Julian for his very patient explanations and re-explanations. 
And I'd like to personally thank Mark for fighting the good fight 
:-).  It sounds like we have consensus that this is an 
implementation-driven behavior not a usage-driven behavior, so 
if/when one day we go to fix it at least we'll already have 
agreement on that as a goal.

Best regards,
-Karl

On 14 Mar 2022, Julian Foad wrote:
>Dear dev community, and especially Karl and Mark:
>
>A plea to test the current design/implementation.
>
>I wonder if we are missing some perspective.
>
>We are worried that the current design won't be acceptable 
>because it
>has poor behaviour in a particular use case.
>
>The use case involved running "svn update" at the root of the 
>WC. (It
>didn't explicitly say that. More precisely, it implied the update 
>target
>tree contains the huge locally modified file.)
>
>Using this new feature necessarily requires some adjustments to 
>user
>expectations and work flow.
>
>What if we ask the user to limit their "svn update" to target the
>particular files/paths that they need to update, keeping their 
>huge
>locally modified file out of its scope? Examples:
>
>svn update readme.txt
>svn update small-docs/
># BUT NOT: svn update the-whole-wc/
>
>Then we side-step the issue. It only fetches pristines for 
>modified
>files that are within the tree scope of the specified 
>targets. (This is
>how it works already, not a proposal.)
>
>OK that's not optimal but it might be sufficient.
>
>(Of course there are further concerns, such as what happens if 
>the user
>starts an update at the WC root, then cancels it as it's taking 
>too
>long: can we gracefully recover? Fine, we can look at those 
>concerns.)
>
>I can go ahead with further work on changing the design if 
>required, but
>I am concerned that might not be the best use of resources. Also 
>I don't
>know how to evaluate the balance of Evgeny's concerns about 
>protocol
>level complexity of the alternative design, against the concerns 
>about
>the present design. In other words pursuing that alternative 
>seems
>riskier, while accepting the known down-sides of the current 
>design is
>sub-optimal but seems less risky.
>
>Should we first test the current design and see if we can work 
>with it,
>before going full steam ahead into changing the design?
>
>The current design/implementation (on branch
>'pristines-on-demand-on-mwf') is in a working state. There are 
>open
>issues that still need to be resolved, but it's complete enough 
>to be
>ready for this level of testing.
>
>- Julian
 
On 14 Mar 2022, Mark Phippard wrote:
>On Mon, Mar 14, 2022 at 6:48 AM Julian Foad 
><ju...@apache.org> wrote:
>>
>> Dear dev community, and especially Karl and Mark:
>>
>> A plea to test the current design/implementation.
>>
>> I wonder if we are missing some perspective.
>
>Hi Julian,
>
>I do not believe I can offer much in the way of testing, but I do 
>want
>to reiterate that I am not objecting to the current state of the
>change. At least not in the veto sense.
>
>My work has taken me away from SVN. I just wanted to bring some 
>user
>perspective into the conversation. I realize there may be
>considerations that make it not the best option to try to solve 
>this.
>I will have to leave that up to you ... and Karl if he has helped 
>fund
>this effort.
>
>Karl did a great job explaining why the current behavior will be
>unexpected to the user. I agree they may have workarounds they 
>can use
>to manage it. How a user feels about those workarounds is hard to
>predict.
>
>Good luck
>
>Mark
 
On 14 Mar 2022, Julian Foad wrote:
>Mark Phippard wrote:
>> [...] I just wanted to bring some user perspective [...]
>Thanks, Mark. Understood.
>
>I also want to clarify that this is my pragmatic side 
>speaking. For
>anyone who doesn't remember me saying this before, I'll repeat 
>that my
>purist side would love to see us do the alternative, optimal, 
>design,
>and work through the consequent protocol issues. That seems 
>obviously
>to me more "Right", but is only useful if we could afford to 
>follow it
>through to completion, and we don't have a good estimate of the 
>effort
>required for that.
>
>- Julian
>
>- Julian
 
On 15 Mar 2022, Julian Foad wrote:
>Just an addendum, perhaps a more positive portrayal of the brief
>exploration of the alternative design approach: my assessment is 
>that
>exploration was enough to show that an initial release based on 
>the
>original approach has possibilities of being improved, 
>incrementally, in
>that way, as and when resources permit.
>
>In other words I am not recommending choosing one approach and
>abandoning the other, but starting with one and postponing the 
>other as
>possible future improvement work.
>
>- Julian
 
On 11 Mar 2022, Julian Foad wrote:
>Thank you, Evgeny. That is exactly the kind of discussion we 
>need, and
>you were able to provide far more detailed insights than I 
>was. That
>should help us decide how to proceed.
>
>As for your thoughts about the current approach for MVP, I tend 
>to agree
>that your approach is likely to be useful for a lot of people's 
>use
>cases. Unfortunately it has turned out that Karl has one of the 
>use
>cases that doesn't match those assumptions, where it doesn't work 
>so well.
>
>Karl, would it make sense now for you to search and compare use 
>cases
>that concern your group, and see if cases that don't work well 
>with the
>current design (such as the particular case we are focusing on 
>right
>now, r1898846 'notes/i525/i525-use-case-4892-minimal-update.txt') 
>are of
>majority or minority concern overall?
>
>One thing we could do is to more formally document use cases in 
>order
>for potential users to spot which ones match their cases.
>
>One thing we could do is release (preview) versions of both 
>approaches and
>get folks to evaluate them.
>
>We need to decide how to spend the next effort here.
>
>- Julian

Re: Issue #525/#4892: on only fetching the pristines we really need

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.
Daniel Shahaf wrote on Wed, Mar 16, 2022 at 04:43:19 +0000:
> Julian Foad wrote on Mon, Mar 14, 2022 at 20:23:29 +0000:
> > Daniel Sahlberg wrote:
> > >[...] I will try to build a release for myself and use it for dev work.
> > Thank you Daniel.
> > 
> > I'm wondering if I (or we) need to do more to facilitate evaluation. I'm thinking of things like adding some feedback to tell the user what it's doing ("fetching missing pristines now..."), maybe at an extra verbose level during this evaluation phase to help users understand it; finding out if any of the outstanding issues need fixing in order to be able to use it productively; maybe getting binaries built and distributed if that helps; maybe we can supply more succinct user documentation than what I wrote so far?
> 
> I think what we need now is users willing and able to test this.  Once
> we do, we can figure out what we need to do in order to make it easier
> for them to test it, whether it's write docs, or add notifications, or
> build binaries, or…
> 
> For starters, ourselves.  Is HEAD of the branch good enough that devs
> with use-cases can start to try it in their real use-case wc's?  It
> won't be possible to downgrade f32 to f31, but if we want, say, to make
> pristines-on-demand toggleable within a format,

(That's SVN-4889.)

>                                                 we can implement that in
> f33 and leave f32 as "never appeared in a release".
> 
> I'll also mention asciinema.  It's basically script(1) into a video
> hosted online.  It might be instructive for us to watch an asciinema
> session of someone trying this branch for the first time.  It's about as
> near as we can get to standing behind their shoulder, without actually
> sharing a machine with them and watching a shared tmux(1) session.
> 
> Cheers,
> 
> Daniel

Re: Issue #525/#4892: on only fetching the pristines we really need

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.
Julian Foad wrote on Wed, Mar 16, 2022 at 07:27:49 +0000:
> Daniel Shahaf wrote:
> >I'll also mention asciinema.  It's basically script(1) into a video
> >hosted online.  It might be instructive for us to watch an asciinema
> >session of someone trying this branch for the first time.  It's about as
> >near as we can get to standing behind their shoulder, without actually
> >sharing a machine with them and watching a shared tmux(1) session.
> 
> Good idea. Anyone willing to try it?

asciinema is packaged in various distros, so you should be able to
install it via your package manager.  See https://repology.org/project/asciinema/versions


Re: Issue #525/#4892: on only fetching the pristines we really need

Posted by Julian Foad <ju...@apache.org>.
Daniel Shahaf wrote:
> [...], ourselves.  Is HEAD of the branch good enough that devs
>with use-cases can start to try it in their real use-case wc's?  It
>won't be possible to downgrade f32 to f31, but [...]

I think so. Anyone trying it should take a quick look through the known bugs filed as dependent on #525, and be aware that with any WC implementation bugs there's the possibility of getting the WC metadata into a stuck or corrupt state. That said, it's good enough for testing.

>I'll also mention asciinema.  It's basically script(1) into a video
>hosted online.  It might be instructive for us to watch an asciinema
>session of someone trying this branch for the first time.  It's about as
>near as we can get to standing behind their shoulder, without actually
>sharing a machine with them and watching a shared tmux(1) session.

Good idea. Anyone willing to try it?

- Julian

Re: Issue #525/#4892: on only fetching the pristines we really need

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.
Julian Foad wrote on Mon, Mar 14, 2022 at 20:23:29 +0000:
> Daniel Sahlberg wrote:
> >[...] I will try to build a release for myself and use it for dev work.
> Thank you Daniel.
> 
> I'm wondering if I (or we) need to do more to facilitate evaluation. I'm thinking of things like adding some feedback to tell the user what it's doing ("fetching missing pristines now..."), maybe at an extra verbose level during this evaluation phase to help users understand it; finding out if any of the outstanding issues need fixing in order to be able to use it productively; maybe getting binaries built and distributed if that helps; maybe we can supply more succinct user documentation than what I wrote so far?

I think what we need now is users willing and able to test this.  Once
we do, we can figure out what we need to do in order to make it easier
for them to test it, whether it's write docs, or add notifications, or
build binaries, or…

For starters, ourselves.  Is HEAD of the branch good enough that devs
with use-cases can start to try it in their real use-case wc's?  It
won't be possible to downgrade f32 to f31, but if we want, say, to make
pristines-on-demand toggleable within a format, we can implement that in
f33 and leave f32 as "never appeared in a release".

I'll also mention asciinema.  It's basically script(1) into a video
hosted online.  It might be instructive for us to watch an asciinema
session of someone trying this branch for the first time.  It's about as
near as we can get to standing behind their shoulder, without actually
sharing a machine with them and watching a shared tmux(1) session.

Cheers,

Daniel

Re: Issue #525/#4892: on only fetching the pristines we really need

Posted by Julian Foad <ju...@apache.org>.
Daniel Sahlberg wrote:
>[...] I will try to build a release for myself and use it for dev work.
Thank you Daniel.

I'm wondering if I (or we) need to do more to facilitate evaluation. I'm thinking of things like adding some feedback to tell the user what it's doing ("fetching missing pristines now..."), maybe at an extra verbose level during this evaluation phase to help users understand it; finding out if any of the outstanding issues need fixing in order to be able to use it productively; maybe getting binaries built and distributed if that helps; maybe we can supply more succinct user documentation than what I wrote so far?
- Julian

Re: Issue #525/#4892: on only fetching the pristines we really need

Posted by Daniel Sahlberg <da...@gmail.com>.
Den mån 14 mars 2022 kl 14:00 skrev Julian Foad <ju...@apache.org>:

> Mark Phippard wrote:
> > [...] I just wanted to bring some user perspective [...]
> Thanks, Mark. Understood.
>
> I also want to clarify that this is my pragmatic side speaking. For anyone
> who doesn't remember me saying this before, I'll repeat that my purist side
> would love to see us do the alternative, optimal, design, and work through
> the consequent protocol issues. That seems obviously to me more "Right",
> but is only useful if we could afford to follow it through to completion,
> and we don't have a good estimate of the effort required for that.


I fully support both your effort and your judgement that we mustn't let
perfect stand in the way of getting a first implementation out of the door.
I've been vocal regarding not shutting some doors for the future but that
doesn't mean I believe them to be critical.

I have a few other things I'm looking at right now but I will try to build
a release for myself and use it for dev work.

Kind regards,
Daniel Sahlberg

Re: Issue #525/#4892: on only fetching the pristines we really need

Posted by Julian Foad <ju...@apache.org>.
Mark Phippard wrote:
> [...] I just wanted to bring some user perspective [...]
Thanks, Mark. Understood.

I also want to clarify that this is my pragmatic side speaking. For anyone who doesn't remember me saying this before, I'll repeat that my purist side would love to see us do the alternative, optimal, design, and work through the consequent protocol issues. That seems obviously to me more "Right", but is only useful if we could afford to follow it through to completion, and we don't have a good estimate of the effort required for that.

- Julian

- Julian

Re: Issue #525/#4892: on only fetching the pristines we really need

Posted by Mark Phippard <ma...@gmail.com>.
On Mon, Mar 14, 2022 at 6:48 AM Julian Foad <ju...@apache.org> wrote:
>
> Dear dev community, and especially Karl and Mark:
>
> A plea to test the current design/implementation.
>
> I wonder if we are missing some perspective.

Hi Julian,

I do not believe I can offer much in the way of testing, but I do want
to reiterate that I am not objecting to the current state of the
change. At least not in the veto sense.

My work has taken me away from SVN. I just wanted to bring some user
perspective into the conversation. I realize there may be
considerations that make it not the best option to try to solve this.
I will have to leave that up to you ... and Karl if he has helped fund
this effort.

Karl did a great job explaining why the current behavior will be
unexpected to the user. I agree they may have workarounds they can use
to manage it. How a user feels about those workarounds is hard to
predict.

Good luck

Mark

Re: Issue #525/#4892: on only fetching the pristines we really need

Posted by Karl Fogel <kf...@red-bean.com>.
On 18 Mar 2022, Nathan Hartman wrote:
>tl;dr: Pretty darn good for a first cut!

W00t!

>The versioned contents are source files; not exactly a huge WC by
>today's standards but believe it or not this makes a big 
>difference
>for me, as I often operate on WCs this size directly on embedded
>systems whose storage is somewhat constrained, and also on 
>ramdrives,
>where it's nice to cut usage in half.

Oh, interesting -- I hadn't thought of that use case at all.

>One thing I noticed: With i525pod enabled, svn does not delete 
>empty
>.svn/pristine/??/ subdirectories when no longer needed. (Not sure 
>what
>the correct terminology is for those 2-digit-hex-code dirs.) It 
>does
>purge the pristine files within these subdirectories, just not 
>the
>subdirectories themselves. They even survive a
>'svn cleanup --vacuum-pristines'. I haven't yet looked at the 
>code to
>see whether non-i525pod SVN ever deletes these or not, or if 
>there's a
>simple fix. I don't consider this a showstopper or a terribly big 
>deal
>but unless this is expected behavior, I'd like to at least file 
>an
>issue for it.

Totally not a showstopper, agreed; I'm not even positive it's 
really a problem?  But it makes sense to have a ticket for it 
until we decide.

>Summary: Overall, I'm impressed so far. Thanks for everyone's
>contributions, whether code, input, or financial, for making this
>possible!

♥

-K

Re: Issue #525/#4892: on only fetching the pristines we really need

Posted by Nathan Hartman <ha...@gmail.com>.
On Mon, Mar 14, 2022 at 6:48 AM Julian Foad <ju...@apache.org> wrote:
>
> Dear dev community, and especially Karl and Mark:
>
> A plea to test the current design/implementation.

Feedback so far on the 'pristines-on-demand-on-mwf' branch as of
r1898990:

tl;dr: Pretty darn good for a first cut!

I merged pristines-on-demand-on-mwf to trunk, built, and tried it on a
real workload. Admittedly I did not run the test suite (though I plan
to). Instead, every step along the way, I compared the 1.15+i525pod WC
directory contents (excluding .svn) to those of a parallel 1.14 WC
checked out and operated upon by a 1.14 client and things seem to be
working correctly, at least insofar as actual files and their contents
are concerned. Repo access was via "svn://" (svnserve).

As of r1898990 it merged cleanly to trunk and built successfully (on
Debian Linux).

The "real workload" mentioned consisted of a working copy containing
several svn:externals; I checked that out, but most operations were
done within one of the nested WCs, which contains the majority of the
data.

That WC contains a total of 19161 versioned files.

When checked out without i525pod: Total 38141 files including
pristines occupying 512M.

When checked out with i525pod: Total 19166 files occupying 270M, and
SVN correctly set the parent and nested WCs to f32 format.

The versioned contents are source files; not exactly a huge WC by
today's standards but believe it or not this makes a big difference
for me, as I often operate on WCs this size directly on embedded
systems whose storage is somewhat constrained, and also on ramdrives,
where it's nice to cut usage in half.

I performed operations: checkout, info, switch, log, patch, update
(including restore), cleanup, merge, diff, status, commit.

I performed several commits. To verify that nothing became hosed here,
I then performed a fresh checkout with a 1.14 client and compared its
contents to the expected known state, which was the contents of the
1.15+i525pod WC. Also I ran 'svnadmin verify' on the repo (which is on
a separate machine) and it verified successfully. (Since the commits
succeeded, I didn't expect otherwise.)

Speed of client operations didn't feel noticeably slower; maybe just a
little bit for some operations for which pristines had to be fetched,
but as the files in question were not huge, it was not very impactful.
I understand that this will probably be more noticeable when the WC
contains huge files that are modified but don't have a convenient way
to test that at the moment.

I used the plaintext credential cache and was not prompted for
passwords at any time. (I should check with a credential method that
does result in prompts and follow up... I understand it may currently
prompt twice for one invocation.)

One thing I noticed: With i525pod enabled, svn does not delete empty
.svn/pristine/??/ subdirectories when no longer needed. (Not sure what
the correct terminology is for those 2-digit-hex-code dirs.) It does
purge the pristine files within these subdirectories, just not the
subdirectories themselves. They even survive a
'svn cleanup --vacuum-pristines'. I haven't yet looked at the code to
see whether non-i525pod SVN ever deletes these or not, or if there's a
simple fix. I don't consider this a showstopper or a terribly big deal
but unless this is expected behavior, I'd like to at least file an
issue for it.

More below inline...

> We are worried that the current design won't be acceptable because it
> has poor behaviour in a particular use case.
>
> The use case involved running "svn update" at the root of the WC. (It
> didn't explicitly say that. More precisely, it implied the update target
> tree contains the huge locally modified file.)

As I said above, I didn't have a convenient way to test this use case
but it would be informative to conjure one up (or wait for feedback
from users with such a use case).

> Using this new feature necessarily requires some adjustments to user
> expectations and work flow.
>
> What if we ask the user to limit their "svn update" to target the
> particular files/paths that they need to update, keeping their huge
> locally modified file out of its scope? Examples:
>
> svn update readme.txt
> svn update small-docs/
> # BUT NOT: svn update the-whole-wc/
>
> Then we side-step the issue. It only fetches pristines for modified
> files that are within the tree scope of the specified targets. (This is
> how it works already, not a proposal.)
>
> OK that's not optimal but it might be sufficient.

We could suggest that as a workaround in the release notes and point
out that this could be optimized gradually in future releases.

> (Of course there are further concerns, such as what happens if the user
> starts an update at the WC root, then cancels it as it's taking too
> long: can we gracefully recover? Fine, we can look at those concerns.)
>
> I can go ahead with further work on changing the design if required, but
> I am concerned that might not be the best use of resources. Also I don't
> know how to evaluate the balance of Evgeny's concerns about protocol
> level complexity of the alternative design, against the concerns about
> the present design. In other words pursuing that alternative seems
> riskier, while accepting the known down-sides of the current design is
> sub-optimal but seems less risky.
>
> Should we first test the current design and see if we can work with it,
> before going full steam ahead into changing the design?
>
> The current design/implementation (on branch
> 'pristines-on-demand-on-mwf') is in a working state. There are open
> issues that still need to be resolved, but it's complete enough to be
> ready for this level of testing.

I think we should test the current design/implementation first; I do
have the following suggestions, if feasible:

1. We should allow the i525pod feature to be enabled/disabled
   separately from updating WC to f32, even if such enabling/disabling
   can occur only at 'checkout' time for MVP, as i525pod should not be
   mandatory moving forward due to user-affecting tradeoffs such as
   some formerly client-only ops now requiring server access.

2. The BRANCH-README suggests to take advantage of the format bump to
   switch to a better checksum than SHA-1. I agree this does make
   sense and could provide users an additional reason to consider
   upgrading, but that increases the importance of item #2 above and
   may detract from the focus of this feature; on the upside, this may
   reduce future effort and would save a format bump.

Summary: Overall, I'm impressed so far. Thanks for everyone's
contributions, whether code, input, or financial, for making this
possible!

Cheers,
Nathan

Re: Issue #525/#4892: on only fetching the pristines we really need

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.
Julian Foad wrote on Wed, Mar 16, 2022 at 21:03:28 +0000:
> Daniel Shahaf wrote:
> >Also, unrelated: have we verified that all the temporary files we create
> >are created in a crash-safe way?  I.e., that if libsvn_wc is SIGKILL'd
> >partway through hydrating something, the something will be cleaned up by
> >libsvn_wc at some point in the future?
> 
> I haven't reviewed for that. Could you perhaps record it somewhere more find-able?

https://issues.apache.org/jira/browse/SVN-4896

> Not sure if it pertains to this issue thread or the whole i525.

The latter.  Sorry for the misunderstanding.

Re: Issue #525/#4892: on only fetching the pristines we really need

Posted by Julian Foad <ju...@apache.org>.
Daniel Shahaf wrote:
>Also, unrelated: have we verified that all the temporary files we create
>are created in a crash-safe way?  I.e., that if libsvn_wc is SIGKILL'd
>partway through hydrating something, the something will be cleaned up by
>libsvn_wc at some point in the future?

I haven't reviewed for that. Could you perhaps record it somewhere more find-able? Not sure if it pertains to this issue thread or the whole i525.

- Julian

Re: Issue #525/#4892: on only fetching the pristines we really need

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.
Julian Foad wrote on Wed, Mar 16, 2022 at 06:52:48 +0000:
> Daniel Shahaf wrote:
> >Julian Foad wrote:
> >> exploration was enough to show that an initial release based on the
> >> original approach has possibilities of being improved, incrementally, in
> >> that way, as and when resources permit.
> >> 
> >> In other words I am not recommending choosing one approach and
> >> abandoning the other, but starting with one and postponing the other as
> >> possible future improvement work.
> >
> >Sorry, but could you spell out what are the "one approach" and "the
> >other"?  Are you proposing to release the code as it is, fetching in
> >advance, and saying you're confident it can in the future be taught to
> >fetch during the operation, notwithstanding kotkov@'s points about
> >RA-level timeouts?
> Yes; while uncertain how much effort it might require to overcome the concerns such as RA-level timeouts.

Sounds good.

Also, unrelated: have we verified that all the temporary files we create
are created in a crash-safe way?  I.e., that if libsvn_wc is SIGKILL'd
partway through hydrating something, the something will be cleaned up by
libsvn_wc at some point in the future?

Cheers,

Daniel

Re: Issue #525/#4892: on only fetching the pristines we really need

Posted by Julian Foad <ju...@apache.org>.
Daniel Shahaf wrote:
>Julian Foad wrote:
>> exploration was enough to show that an initial release based on the
>> original approach has possibilities of being improved, incrementally, in
>> that way, as and when resources permit.
>> 
>> In other words I am not recommending choosing one approach and
>> abandoning the other, but starting with one and postponing the other as
>> possible future improvement work.
>
>Sorry, but could you spell out what are the "one approach" and "the
>other"?  Are you proposing to release the code as it is, fetching in
>advance, and saying you're confident it can in the future be taught to
>fetch during the operation, notwithstanding kotkov@'s points about
>RA-level timeouts?
Yes; while uncertain how much effort it might require to overcome the concerns such as RA-level timeouts.
- Julian

Re: Issue #525/#4892: on only fetching the pristines we really need

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.
Julian Foad wrote on Tue, Mar 15, 2022 at 20:10:24 +0000:
> Just an addendum, perhaps a more positive portrayal of the brief
> exploration of the alternative design approach: my assessment is that
> exploration was enough to show that an initial release based on the
> original approach has possibilities of being improved, incrementally, in
> that way, as and when resources permit.
> 
> In other words I am not recommending choosing one approach and
> abandoning the other, but starting with one and postponing the other as
> possible future improvement work.

Sorry, but could you spell out what are the "one approach" and "the
other"?  Are you proposing to release the code as it is, fetching in
advance, and saying you're confident it can in the future be taught to
fetch during the operation, notwithstanding kotkov@'s points about
RA-level timeouts?

Cheers,

Daniel

Re: Issue #525/#4892: on only fetching the pristines we really need

Posted by Julian Foad <ju...@apache.org>.
Just an addendum, perhaps a more positive portrayal of the brief
exploration of the alternative design approach: my assessment is that
exploration was enough to show that an initial release based on the
original approach has possibilities of being improved, incrementally, in
that way, as and when resources permit.

In other words I am not recommending choosing one approach and
abandoning the other, but starting with one and postponing the other as
possible future improvement work.

- Julian


Re: Issue #525/#4892: on only fetching the pristines we really need

Posted by Julian Foad <ju...@apache.org>.
The patch I sent is now committed in r1898948 as a new branch,
'pristines-on-demand-on-issue4892', for easier dev/test access.

Re: Issue #525/#4892: on only fetching the pristines we really need

Posted by Julian Foad <ju...@apache.org>.
Johan Corveleyn wrote:
>Well, as I said, I don't have huge binaries myself [...]. That's just me speculating of course [...]
Hi, Johan. I really appreciate you taking the time to write all your thoughts. That kind of speculation has its place and was certainly useful earlier, but we're now at a point where it could be counter-productive. I think we now need to focus on real use cases and real testing.
- Julian

Re: Issue #525/#4892: on only fetching the pristines we really need

Posted by Johan Corveleyn <jc...@gmail.com>.
On Mon, Mar 14, 2022 at 5:26 PM Julian Foad <ju...@apache.org> wrote:
>
> Johan Corveleyn wrote:
> >Speaking from the peanut gallery, [...]
> >If I would be a user with several huge binaries in the repo / WC, I
> >imagine I would not be happy with this proposal. The reason is that I
> >have always, forever, only done "svn update the-whole-wc". Updating
> >individual subdirs is micro-managing. [...]
>
> But do you locally modify those huge binaries?

Well, as I said, I don't have huge binaries myself (nor do my
colleagues, for that matter). We have a single build server with 400
working copies, each with 100000's of "normal sized" files. But
apparently some users, like Karl (or Karl's customer), do have huge
binaries in their WC, and I suppose sometimes they will be modified.

All I can say is: individual-directory-updating is not something I've
often witnessed being used (unless maybe for a short ad-hoc purpose,
not on a regular basis). With lots of things going on, and lots of
things being worked on at the same time, the usual workflow I know is
"update entire WC to keep up with others; start working (or continue
where you left yesterday)". So if I would locally modify a huge binary
(it might be left there for days, weeks, ... locally modified, until
I'm happy for it to be committed), every time I update I would update
the entire WC. That's just me speculating of course ... maybe I should
leave the floor for Karl and others ...

Just one more thing: if people "micro-manage" their working copy, in
my experience they will usually do this once, by setting up an
appropriately arranged sparse working copy (leaving out for example
the huge binaries of other teams they're not interested in, or source
modules that are not important for their work). After this one-time
setup (now and then finetuned to exclude or add another directory or
file), people just "svn update entire-sparse-wc". It makes no sense to
me to update only subdirectories, at least not on a regular basis.
That's too much fiddly work everytime (and you can't save that
"to-be-updated selection" for next time), and you might end up with
inconsistent stuff locally (for instance a library not being updated
together with its callers). But again, I might have wrong assumptions
here, since I don't work with huge binaries in SVN myself.

-- 
Johan

Re: Issue #525/#4892: on only fetching the pristines we really need

Posted by Julian Foad <ju...@apache.org>.
Johan Corveleyn wrote:
>Speaking from the peanut gallery, [...]
>If I would be a user with several huge binaries in the repo / WC, I
>imagine I would not be happy with this proposal. The reason is that I
>have always, forever, only done "svn update the-whole-wc". Updating
>individual subdirs is micro-managing. [...]
But do you locally modify those huge binaries?
- Julian

Re: Issue #525/#4892: on only fetching the pristines we really need

Posted by Johan Corveleyn <jc...@gmail.com>.
On Mon, Mar 14, 2022 at 11:48 AM Julian Foad <ju...@apache.org> wrote:
>
> Dear dev community, and especially Karl and Mark:
>
> A plea to test the current design/implementation.
>
> I wonder if we are missing some perspective.
>
> We are worried that the current design won't be acceptable because it
> has poor behaviour in a particular use case.
>
> The use case involved running "svn update" at the root of the WC. (It
> didn't explicitly say that. More precisely, it implied the update target
> tree contains the huge locally modified file.)
>
> Using this new feature necessarily requires some adjustments to user
> expectations and work flow.
>
> What if we ask the user to limit their "svn update" to target the
> particular files/paths that they need to update, keeping their huge
> locally modified file out of its scope? Examples:
>
> svn update readme.txt
> svn update small-docs/
> # BUT NOT: svn update the-whole-wc/
>
> Then we side-step the issue. It only fetches pristines for modified
> files that are within the tree scope of the specified targets. (This is
> how it works already, not a proposal.)
>
> OK that's not optimal but it might be sufficient.

Speaking from the peanut gallery, as I or "my users" probably won't be
bothered too much by the occasional "superfluous pristine" (as I said
before, I'm more interested in this feature for our build server with
400 working copies -- not particularly for rare huge files, but more
for the general overhead of duplicating thousands of files):

If I would be a user with several huge binaries in the repo / WC, I
imagine I would not be happy with this proposal. The reason is that I
have always, forever, only done "svn update the-whole-wc". Updating
individual subdirs is micro-managing. I update my entire project (i.e.
working copy) daily just to keep up with whatever my 100 colleagues
have done since the last time I updated. It's just daily routine:
Press Ctrl-T (Update Project) in IntelliJ, and start working. I'm not
going to bother thinking about what exactly I need updated, and
multi-selecting the 80 subdirs that I might need to keep in sync with
the most important stuff.

That's just my opinion, or my feeling trying to place myself in the
position of such users. I'm not doing any actual work here, so I
certainly don't want to stand in the way of consensus / progress, if
the current implementation is sufficient for others.

-- 
Johan

Re: Issue #525/#4892: on only fetching the pristines we really need

Posted by Julian Foad <ju...@apache.org>.
Dear dev community, and especially Karl and Mark:

A plea to test the current design/implementation.

I wonder if we are missing some perspective.

We are worried that the current design won't be acceptable because it
has poor behaviour in a particular use case.

The use case involved running "svn update" at the root of the WC. (It
didn't explicitly say that. More precisely, it implied the update target
tree contains the huge locally modified file.)

Using this new feature necessarily requires some adjustments to user
expectations and work flow.

What if we ask the user to limit their "svn update" to target the
particular files/paths that they need to update, keeping their huge
locally modified file out of its scope? Examples:

svn update readme.txt
svn update small-docs/
# BUT NOT: svn update the-whole-wc/

Then we side-step the issue. It only fetches pristines for modified
files that are within the tree scope of the specified targets. (This is
how it works already, not a proposal.)

OK that's not optimal but it might be sufficient.

(Of course there are further concerns, such as what happens if the user
starts an update at the WC root, then cancels it as it's taking too
long: can we gracefully recover? Fine, we can look at those concerns.)

I can go ahead with further work on changing the design if required, but
I am concerned that might not be the best use of resources. Also I don't
know how to evaluate the balance of Evgeny's concerns about protocol
level complexity of the alternative design, against the concerns about
the present design. In other words pursuing that alternative seems
riskier, while accepting the known down-sides of the current design is
sub-optimal but seems less risky.

Should we first test the current design and see if we can work with it,
before going full steam ahead into changing the design?

The current design/implementation (on branch
'pristines-on-demand-on-mwf') is in a working state. There are open
issues that still need to be resolved, but it's complete enough to be
ready for this level of testing.

- Julian


Re: Issue #525/#4892: on only fetching the pristines we really need

Posted by Julian Foad <ju...@apache.org>.
"Restore" isn't involved in our use cases. The only reason I mentioned the "restore" functionality in the first place is because my proof-of-concept patch deliberately leaves that broken because it's non-core functionality, that also will need modifying to make it work in the new way if we proceed that way. I don't expect it to be particularly difficult to fix; no more difficult than modifying the main update functionality. I wish I hadn't mentioned the possibility that we might choose to ignore this minor sub feature, it's just distracting us from discussing the main functionality.

Understanding the current issue about pristines and updates can be achieved while completely ignoring "restore".
- Julian

Re: Issue #525/#4892: on only fetching the pristines we really need

Posted by Mark Phippard <ma...@gmail.com>.
On Sun, Mar 13, 2022 at 3:32 AM Johan Corveleyn <jc...@gmail.com> wrote:
>
> On Fri, Mar 11, 2022 at 9:17 PM Nathan Hartman <ha...@gmail.com> wrote:
> > If possible and not overly burdensome, I think it would be a good
> > thing to keep the "restore" functionality for the following reasons:
> [snip]
>
> I agree. I know about the restore feature too, and am used to it.
> Also, I think it would be a mistake to create different behaviour
> between "normal (pristine-full)" and "pristines-on-demand" working
> copies.

I have also used this feature of update before.

What still feels weird is that the feature does not seem relevant to
the situation. The file in the WC is NOT missing in this scenario so
there is no file to restore. The pristine file is the one that is
missing and it is just not clear why the WC code would even be looking
for this scenario.

Anyway, thanks for taking the time to explain it and I understand that
despite my confusion and questions it is a difficult problem to solve.

Mark

Re: Issue #525/#4892: on only fetching the pristines we really need

Posted by Johan Corveleyn <jc...@gmail.com>.
On Fri, Mar 11, 2022 at 9:17 PM Nathan Hartman <ha...@gmail.com> wrote:
> If possible and not overly burdensome, I think it would be a good
> thing to keep the "restore" functionality for the following reasons:
[snip]

I agree. I know about the restore feature too, and am used to it.
Also, I think it would be a mistake to create different behaviour
between "normal (pristine-full)" and "pristines-on-demand" working
copies.

On "update's essential functionality": the main problem, I feel, is
that the current implementation fetches missing pristines for modified
files *even if they will not be updated*. I understand that this
happens because, at the point of "textbase-sync" (check for and
download missing pristines, which happens before the operation goes
into details), it doesn't know yet which files will effectively
receive an update. So this really feels like a waste (fetching stuff
which will not even be needed, and indeed in many use cases like
Karl's often won't be).

But you all know that already, I guess, and I appreciate all efforts
in trying to come up with creative approaches to avoid this "waste"
:-).

Don't know if this is realistic or useful, but:
Apart from pushing this textbase-sync closer to the point of access,
one could imagine that during textbase-sync the client would already
have more information about what will be updated later on. For example
by organizing the client-server exchange so that the client first
receives a report, without texts, about what will be updated, then
fetching missing pristines if needed, and then going on to request the
actual update texts. IIRC, this kind of exchange is already the
default since 1.8 with serf (except if certain directives / config
flags are set) [1]. Then again, it might be quite awkward to implement
this kind of flow only for one particular variation of our
client-server protocol.

[1] https://subversion.apache.org/docs/release-notes/1.8.html#serf-skelta-default

-- 
Johan

Re: Issue #525/#4892: on only fetching the pristines we really need

Posted by Nathan Hartman <ha...@gmail.com>.
On Fri, Mar 11, 2022 at 2:36 PM Julian Foad <ju...@apache.org> wrote:
> The "restore missing files" is an odd thing that "update" does outside
> of its core purpose: when any versioned file in its scope is missing
> from disk, it puts an unmodified copy of the file back on disk there, by
> copying (and translating keywords/eol-style) from its pristine version.
> It's something like an implicit revert for files that were deleted from
> disk without using "svn delete". Why so, I don't know, and I don't like
> it, but it's there, so let's go on to look at how it interacts with
> absent pristines.
(snip)
> It's probably a rare case in most users' work flows but it's something
> that exists we have to deal with one way or another, even if by
> declaring it no longer supported and changing test expectations accordingly.

I am glad that you explained the "restore thing." That is actually a
workflow that I use, probably more often than I should! I don't
remember whether it's an old habit dating back to CVS, whether it was
ever documented in the svn-book, or whether I simply observed that
deleting a file and calling 'svn update' restores it, but this feature
is kind of ingrained in my mind as part of how SVN works. I do use
'svn revert' when appropriate, but sometimes 'svn update' seems more
appropriate. Yes, this may be more of a habit than a real use case,
but bear with me for a moment...

If possible and not overly burdensome, I think it would be a good
thing to keep the "restore" functionality for the following reasons:

1. This is how SVN has worked since ages and ages ago and may be
ingrained in the minds of other users besides me; if so, then getting
rid of it may be considered a backwards compatibility breaking change
in terms of usage.

2. Conceptually the restore functionality makes sense, since it brings
a working copy up to date with the repository; any files in the
repository which are not in the working copy should therefore be
populated.

3. "svn checkout" is the same thing as bootstrapping a working copy
without the contents and then running "svn update" to "restore" the
missing files, which are missing because they were never downloaded in
the first place. The code in svn_client__checkout_internal() explains
this in comments.

4. An interrupted "checkout" or "update" (interrupted, e.g., with
Ctrl+C) can be resumed with "svn update" (after running "svn cleanup"
to release locks).

Removing the "restore" functionality will probably break the above;
unfortunately I don't know what else may be affected.

Just my 2 cents. If it absolutely has to go, I'll get used to using
'revert' instead, but I just wanted to point out that this may cause
other issues...

Cheers,
Nathan

Re: Issue #525/#4892: on only fetching the pristines we really need

Posted by Julian Foad <ju...@apache.org>.
Johan Corveleyn wrote:
>It's not specific to 'svn update' per se, but it's logical that it
>leads to this discussion, because it is a (commonly used) case where
>the pristine is not actually needed for the operation (if there is no
>actual incoming update to the concerned file). 'svn diff' and 'svn
>revert' cannot do their work without the pristine, but 'update without
>an actual incoming edit'?
>
>And even with an 'incoming edit on update (on top of local mod)' it
>might in theory be possible to delay the download of the full pristine
>until after conflict-resolution decision (but I imagine that's even
>more difficult to untangle).
>
>Also, why should 'svn update' be in the business of (silently)
>restoring "the branch's invariant" (even when it does not need the
>file), and not any other operation (like 'svn status -u' for example)?

Thanks for adding all this rationale. +1 to it all. 100% agreed.


- Julian

Re: Issue #525/#4892: on only fetching the pristines we really need

Posted by Johan Corveleyn <jc...@gmail.com>.
On Wed, Mar 16, 2022 at 9:59 PM Julian Foad <ju...@apache.org> wrote:
> Daniel Shahaf wrote:
> > Also, why is this specific to «svn update»?
>
> It's not specific to update. Update is a particular case that Karl cares about so I looked at doing "update" first. Implementing this approach in one subcommand at a time could be considered releasable incremental steps, because each one is a further optimisation.

It's not specific to 'svn update' per se, but it's logical that it
leads to this discussion, because it is a (commonly used) case where
the pristine is not actually needed for the operation (if there is no
actual incoming update to the concerned file). 'svn diff' and 'svn
revert' cannot do their work without the pristine, but 'update without
an actual incoming edit'?

And even with an 'incoming edit on update (on top of local mod)' it
might in theory be possible to delay the download of the full pristine
until after conflict-resolution decision (but I imagine that's even
more difficult to untangle).

Also, why should 'svn update' be in the business of (silently)
restoring "the branch's invariant" (even when it does not need the
file), and not any other operation (like 'svn status -u' for example)?

-- 
Johan

Re: Issue #525/#4892: on only fetching the pristines we really need

Posted by Julian Foad <ju...@apache.org>.
Daniel Shahaf wrote:
>> The request is to break the original design's invariant for this case.
>
> By only hydrating files that have been updated repository-side.  How
> will small, modified files that _haven't_ been remotely modified get
> hydrated, then?  The logic is the same for small and large files, IIUC.

They will get hydrated by any operation that needs them to be, such as diff and revert.

> Also, why is this specific to «svn update»? 

It's not specific to update. Update is a particular case that Karl cares about so I looked at doing "update" first. Implementing this approach in one subcommand at a time could be considered releasable incremental steps, because each one is a further optimisation.

> It seems to apply equally
> well to «svn diff» without further arguments, since the "large" files
> are presumed to be undiffable, 

That's not presumed. It's a rule of thumb that large files are often undiffable, but as yet there has been no plan to omit them from a requested diff by default. That could be a future sub-feature.

> If the issue does apply not only to 'update' but also to 'diff', that
> suggests we should look for a solution that applies to both of them
> (e.g., exclude "large" files from being recursed into by default, or
> make it so "large" files _never_ get hydrated).

That is a possible alternative to consider. Perhaps even a good one worth filing and discussing further.


- Julian

Re: Issue #525/#4892: on only fetching the pristines we really need

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.
Julian Foad wrote on Wed, Mar 16, 2022 at 19:49:38 +0000:
> Daniel Shahaf wrote:
> > [...]I suspect I'm still missing something.
> 
> I suggest you re-read the issue 4892 use case: https://svn.apache.org/repos/asf/subversion/branches/pristines-on-demand-issue4892/notes/i525/i525-use-case-4892-minimal-update.txt
> 
> The request is to break the original design's invariant for this case.

By only hydrating files that have been updated repository-side.  How
will small, modified files that _haven't_ been remotely modified get
hydrated, then?  The logic is the same for small and large files, IIUC.

Also, why is this specific to «svn update»?  It seems to apply equally
well to «svn diff» without further arguments, since the "large" files
are presumed to be undiffable, but the issue, the notes, and the OP of
this thread all treat «svn update» as a _sui generis_ case.

If the issue does apply not only to 'update' but also to 'diff', that
suggests we should look for a solution that applies to both of them
(e.g., exclude "large" files from being recursed into by default, or
make it so "large" files _never_ get hydrated).

Sorry, I feel like I must be asking questions that must have already
been discussed, but I _have_ read the threads and I still don't know the
answers to these.

Cheers,

Daniel

Re: Issue #525/#4892: on only fetching the pristines we really need

Posted by Julian Foad <ju...@apache.org>.
Daniel Shahaf wrote:
> [...]I suspect I'm still missing something.

I suggest you re-read the issue 4892 use case: https://svn.apache.org/repos/asf/subversion/branches/pristines-on-demand-issue4892/notes/i525/i525-use-case-4892-minimal-update.txt

The request is to break the original design's invariant for this case.
- Julian

Re: Issue #525/#4892: on only fetching the pristines we really need

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.
> >> [...] next a similar pattern applies to the "normal" part of the
> >> update (everything it does after "restore"). Obviously we need the
> >> normal part of update
> >
> >Yes, but for the "deltas" part of update we already mostly DTRT, don't we?
> >
> >- If the file is not modified, [...]
> >
> >- If the file is locally modified, then by design, we need to end up
> >  with a pristine for it.  Right now we'll download BASE, and then
> >  [...]  What am I missing?
> 
> You're missing the case where the file is locally modified, and is in
> the tree scope of the update request, but no update is found in the
> repo. Currently we download its base before executing the business
> logic of update, so before we know that we're not going to need the
> base to complete this update request.

But that's exactly the branch's invariant, isn't it?

[[[
The core idea is that we start to maintain the following
invariant: only the modified files have their pristine text-base
files available on the disk.
]]]

So, if the file is locally modified, and we download its base, we cause
the file to meet the invariant.  I don't see how that's a problem,
unless we download a base we already have, or discard the base rather
than keep it.

I suspect I'm still missing something.

Cheers,

Daniel

Re: Issue #525/#4892: on only fetching the pristines we really need

Posted by Julian Foad <ju...@foad.me.uk>.
Daniel Shahaf wrote:
>> Stick with the idea, for now, that we do need to handle that "restore"
>> part of update.
>
>Can we deprecate it?

People already argued for keeping it. No need to spend more time discussing that now, as I pointed out the effort required to make it work this new way (fetch at point of use, for want of a better term) doesn't look large.

>> [...] next a similar pattern applies to the "normal" part of the
>> update (everything it does after "restore"). Obviously we need the
>> normal part of update
>
>Yes, but for the "deltas" part of update we already mostly DTRT, don't we?
>
>- If the file is not modified, [...]
>
>- If the file is locally modified, then by design, we need to end up
>  with a pristine for it.  Right now we'll download BASE, and then
>  [...]  What am I missing?

You're missing the case where the file is locally modified, and is in the tree scope of the update request, but no update is found in the repo. Currently we download its base before executing the business logic of update, so before we know that we're not going to need the base to complete this update request.


- Julian

Re: Issue #525/#4892: on only fetching the pristines we really need

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.
Julian Foad wrote on Fri, Mar 11, 2022 at 19:36:41 +0000:
> Stick with the idea, for now, that we do need to handle that "restore"
> part of update.

Can we deprecate it?

In the API, create an svn_client_updateN() function that's documented to
be like svn_client_updateN-1() but without reverting absent files.  In
the CLI, create an «svn update2» command with the same caveat.  Tell
people who use pristines-on-demand to use «svn update2» rather than «svn
update».  Would this work?

> The alternative approach is to add a callback to the
> lower level "get the pristine" function that it uses. Then it would be
> able to fetch what it needs when it needs it and not fetch anything it
> doesn't need. Evgeny cautions us about that alternative approach, but
> *in principle* if we could get the protocol level behaviour absolutely
> right, that would (I think) surely be better.
> 
> That "restore" happens, within update, before the server tells us
> whether any update of that particular file is actually present on the
> repository. (Perhaps it could be moved to afterwards; I haven't
> investigated that possibility.)
> 
> Then, never mind whether we care about supporting that "restore" thing;
> because next a similar pattern applies to the "normal" part of the
> update (everything it does after "restore"). Obviously we need the
> normal part of update

Yes, but for the "deltas" part of update we already mostly DTRT, don't we?

- If the file is not modified, the WORKING file doubles as an on-demand
  BASE (it gets detranslated when BASE is called for) so hydrating it is
  a no-op.

- If the file is locally modified, then by design, we need to end up
  with a pristine for it.  Right now we'll download BASE, and then
  download a delta to the new BASE and do a 3-way merge.  Can we avoid
  downloading either the old BASE or the delta?
  
  There are three possible cases (cf. merge_file_trivial()):

  + the old and new BASEs are equal, in which case the delta download
    and application is O(1), and the BASE download is fine because
    a modified file is _supposed_ to have a pristine; or

  + WORKING and the new BASE and are equal, in which case we will
    download stuff we don't need to (but this is an edge case); or
    
  + we need to run a three-way merge, which means we need the 
    .rN .rM .mine all handy.  We have only one file, so we need the
    server to send us the other two, and the question is just whether
    the delta from .rN to .rM would be applied client-side or
    server-side.  The .rN .rM .mine files might be merged by us with
    a 'G' notification, or we might give up and throw a 'C' notification,
    but we can't avoid downloading two of the three files — unless
    someone knows an rsync-like way to do diff3 on three files when two
    of them are not locally available.

    Users can avoid this case by using svn:needs-lock on these large
    files, to ensure the old and new BASEs will be equal.  Cf.
    https://mail-archives.apache.org/mod_mbox/subversion-dev/202201.mbox/%3C20220131115758.GA14771%40tarpaulin.shahaf.local2%3E

So, it seems to me we're doing what we can except in the case of an
update that would make the new BASE identical to WORKING; and should
recommend that users consider svn:needs-lock.  What am I missing?

> > [...] From my laymans point of view, we have a database in
> > the WC that says what we have. I assumed we largely would be using
> > that information when talking to the server about what it needs to
> > send us to do an update.
> 
> Well, yes and no in the current two-phase (hydration, then operation) approach.
> 
> > So I am just not getting why the server needs
> > to send us a file that WC already has.
> 
> It doesn't ever send us a file that the WC already has. The issue we're
> concerned about is it sending a pristine that is (knowingly) absent from
> the pristine store, but for a file whose pristine is not going to be
> looked at during the current update. It might be needed by some future
> operation but the current approach fetched it pre-emptively (by design)
> but for this use case we would rather not do that.
> 
> > I know your answer will be the "restore situation" [...]
> 
> That's not the essential part, no; the main part of "update" is
> obviously the essential part, and it has similar characteristics and
> options, and isn't optional.
> 
> - Julian
> 

Re: Issue #525/#4892: on only fetching the pristines we really need

Posted by Julian Foad <ju...@apache.org>.
Mark Phippard wrote:
> Is there a way to describe it in a way that a really experienced SVN
> "user" not "developer" would understand? Set aside the pristinelsss WC
> feature. What is the scenario in previous versions of SVN where this
> behavior is kicking in?

Hi, Mark. I have been mainly aiming my writing to those taking a
detailed look and reviewing the code, while at the same time I don't
want to leave anyone here, even on the periphery of the development
community, in the dark. I'll give it one more shot in this subthread.

The current approach adds a "textbase sync" phase before every
operation, which is crudely, pessimistically, fetches pristines for all
locally modified files within the tree scope of the operation's target
paths, if they are (knowingly) absent from the pristine store. This
approach then uses invokes the old business logic of the operation, with
little modifications inside it, in particular no callback to fetch
pristines on demand after this point.

The "restore missing files" is an odd thing that "update" does outside
of its core purpose: when any versioned file in its scope is missing
from disk, it puts an unmodified copy of the file back on disk there, by
copying (and translating keywords/eol-style) from its pristine version.
It's something like an implicit revert for files that were deleted from
disk without using "svn delete". Why so, I don't know, and I don't like
it, but it's there, so let's go on to look at how it interacts with
absent pristines.

If we have pristines absent on purpose (as we do in this branch), and we
simply disable the "textbase sync" a.k.a. "hydration" that runs at the
start of an update in our current approach, then that "restore" part of
the update errors out (unable to read the pristine) when it encounters a
file missing on disk.

It's probably a rare case in most users' work flows but it's something
that exists we have to deal with one way or another, even if by
declaring it no longer supported and changing test expectations accordingly.

Stick with the idea, for now, that we do need to handle that "restore"
part of update. The alternative approach is to add a callback to the
lower level "get the pristine" function that it uses. Then it would be
able to fetch what it needs when it needs it and not fetch anything it
doesn't need. Evgeny cautions us about that alternative approach, but
*in principle* if we could get the protocol level behaviour absolutely
right, that would (I think) surely be better.

That "restore" happens, within update, before the server tells us
whether any update of that particular file is actually present on the
repository. (Perhaps it could be moved to afterwards; I haven't
investigated that possibility.)

Then, never mind whether we care about supporting that "restore" thing;
because next a similar pattern applies to the "normal" part of the
update (everything it does after "restore"). Obviously we need the
normal part of update

> [...] From my laymans point of view, we have a database in
> the WC that says what we have. I assumed we largely would be using
> that information when talking to the server about what it needs to
> send us to do an update.

Well, yes and no in the current two-phase (hydration, then operation) approach.

> So I am just not getting why the server needs
> to send us a file that WC already has.

It doesn't ever send us a file that the WC already has. The issue we're
concerned about is it sending a pristine that is (knowingly) absent from
the pristine store, but for a file whose pristine is not going to be
looked at during the current update. It might be needed by some future
operation but the current approach fetched it pre-emptively (by design)
but for this use case we would rather not do that.

> I know your answer will be the "restore situation" [...]

That's not the essential part, no; the main part of "update" is
obviously the essential part, and it has similar characteristics and
options, and isn't optional.

- Julian


Re: Issue #525/#4892: on only fetching the pristines we really need

Posted by Mark Phippard <ma...@gmail.com>.
On Fri, Mar 11, 2022 at 12:21 PM Julian Foad <ju...@apache.org> wrote:

> > This is where the question comes in ... why does not having the
> > pristines change this? The WC still knows what files it has and what
> > revisions. Isn't this what drives the process? I just do not
> > understand what has changed that forces the server to send the client
> > the version of a file that the client already knows it has.
>
> My best recent explanation I can offer was in my reply about 5 messages
> ago, at 10:36 UTC, with subsections "Restore is an aberration" and
> "Merge". Not sure what exactly you're missing; maybe not seeing the
> overall difference between the two approaches that are currently being compared?

I think it is because that message was "meaningless" to me. What I
mean, is that I have never heard this "restore" term before so I
really just did not understand what you are talking about and why not
having the pristine triggers this behavior.

Is there a way to describe it in a way that a really experienced SVN
"user" not "developer" would understand? Set aside the pristinelsss WC
feature. What is the scenario in previous versions of SVN where this
behavior is kicking in?

To put it all another way, I am guessing I just did not understand why
it has to trigger all of these complications that Evgeny described. I
trust both of you that these are real I am just finding it hard to
understand why. From my laymans point of view, we have a database in
the WC that says what we have. I assumed we largely would be using
that information when talking to the server about what it needs to
send us to do an update. So I am just not getting why the server needs
to send us a file that WC already has.

I know your answer will be the "restore situation" but again I have
been around since 2003 and I have no idea what you are talking about.
I do not recall this being something I have heard before so I assume
it is some low level detail in the code that just washed over me in
the past.

As I originally said, explaining this to me is not going to fix
anything so I am also OK with leaving me in the dark. I value your
time.

Mark

Re: Issue #525/#4892: on only fetching the pristines we really need

Posted by Julian Foad <ju...@apache.org>.

On Mar 11 2022, at 5:07 pm, Mark Phippard <ma...@gmail.com> wrote:

> On Fri, Mar 11, 2022 at 10:24 AM Evgeny Kotkov
> <ev...@visualsvn.com> wrote:
>> 
>> Julian Foad <ju...@apache.org> writes:
>> 
>> > Conclusions:
>> > ------------
>> >
>> > It is certainly possible that we could modify "update" and the other
>> > "online" operations, at least, and the previously "offline" operations
>> > too if we want, to make them fetch pristines at point-of-use in
>> this way.
>> >
>> > Such modifications are not trivial. There is the need to run additional
>> > RA requests in between the existing ones, perhaps needing an additional
>> > RA session to be established in parallel, or taking care with inserting
>> > RA requests into an existing session.
>> 
>> I think that this part has a lot of protocol constraints and hidden complexity.
>> And things could probably get even more complex for merge and diff.
>> 
>> Consider a bulk update report over HTTP, which is just a single
>> response that
>> has to be consumed in a streamy fashion.
> 
> Feel free to ignore this question because it is not going to help us
> get to a solution, but there is something about all of this I do not
> understand.
> 
> Today, in normal SVN, if I do svn update my assumption is the client
> and server negotiate a fairly efficient reply. The server only sends
> the client the data it needs to update.  [...]

Correct.

> This is where the question comes in ... why does not having the
> pristines change this? The WC still knows what files it has and what
> revisions. Isn't this what drives the process? I just do not
> understand what has changed that forces the server to send the client
> the version of a file that the client already knows it has.

My best recent explanation I can offer was in my reply about 5 messages
ago, at 10:36 UTC, with subsections "Restore is an aberration" and
"Merge". Not sure what exactly you're missing; maybe not seeing the
overall difference between the two approaches that are currently being compared?

- Julian


Re: Issue #525/#4892: on only fetching the pristines we really need

Posted by Mark Phippard <ma...@gmail.com>.
On Fri, Mar 11, 2022 at 10:24 AM Evgeny Kotkov
<ev...@visualsvn.com> wrote:
>
> Julian Foad <ju...@apache.org> writes:
>
> > Conclusions:
> > ------------
> >
> > It is certainly possible that we could modify "update" and the other
> > "online" operations, at least, and the previously "offline" operations
> > too if we want, to make them fetch pristines at point-of-use in this way.
> >
> > Such modifications are not trivial. There is the need to run additional
> > RA requests in between the existing ones, perhaps needing an additional
> > RA session to be established in parallel, or taking care with inserting
> > RA requests into an existing session.
>
> I think that this part has a lot of protocol constraints and hidden complexity.
> And things could probably get even more complex for merge and diff.
>
> Consider a bulk update report over HTTP, which is just a single response that
> has to be consumed in a streamy fashion.

Feel free to ignore this question because it is not going to help us
get to a solution, but there is something about all of this I do not
understand.

Today, in normal SVN, if I do svn update my assumption is the client
and server negotiate a fairly efficient reply. The server only sends
the client the data it needs to update.  Maybe I am wrong, but I do
not feel like I am.

This is where the question comes in ... why does not having the
pristines change this? The WC still knows what files it has and what
revisions. Isn't this what drives the process? I just do not
understand what has changed that forces the server to send the client
the version of a file that the client already knows it has.

Mark

Re: Issue #525/#4892: on only fetching the pristines we really need

Posted by Evgeny Kotkov <ev...@visualsvn.com>.
Julian Foad <ju...@apache.org> writes:

> Conclusions:
> ------------
>
> It is certainly possible that we could modify "update" and the other
> "online" operations, at least, and the previously "offline" operations
> too if we want, to make them fetch pristines at point-of-use in this way.
>
> Such modifications are not trivial. There is the need to run additional
> RA requests in between the existing ones, perhaps needing an additional
> RA session to be established in parallel, or taking care with inserting
> RA requests into an existing session.

I think that this part has a lot of protocol constraints and hidden complexity.
And things could probably get even more complex for merge and diff.

Consider a bulk update report over HTTP, which is just a single response that
has to be consumed in a streamy fashion.  There is no request multiplexing,
and fetching data through a separate connection is going to limit the maximum
size of a pristine file that can be downloaded without receiving a timeout on
the original connection. Assuming the default HTTP timeout of httpd 2.4.x
(60 seconds) and 100 MB/s data transfer rate, the limit for a pristine size
is going to be around 6 GB.

This kind of problem probably isn't limited to just this specific example and
protocol, considering things like that an update editor driver transfers the
control at certain points (e.g., during merge) and thus cannot keep reading
the response.

When I was working on the proof-of-concept, encountering these issues stopped
me from considering the approach with fetching pristines at the point of access
as being practically feasible.  That also resulted in the alternative approach,
initially implemented on the `pristines-on-demand` branch.

Going slightly off-topic, I tend to think that even despite its drawbacks,
the current approach on the branch should work reasonably well for the MVP.

To elaborate on that, let me first share a few assumptions and thoughts I had
in mind at that point of time:

1) Let's assume a high-throughput connection to the server, 1 Gbps or better.

   With slower networks, working with large files is going to be problematic
   just by itself, so that might be thought as being out of scope.

2) Let's assume that a working copy contains a large number of blob-like files.

   In other words, there are thousands of 10-100 MB files, as opposed to one
   or two 50 GB files.

3) Let's assume that in a common case, only a small fraction of these files
   are modified in the working copy.

Then for a working copy with 1,000 of 100 MB files and 10 modified files:

A) Every checkout saves 100 GB of disk space; that's pretty significant for
   a typical solid state drive.

B) Hydrating won't transfer more than 1 GB of data, or 10 seconds under an
   optimistic assumption.

C) For a more uncommon case with 100 modified files, it's going to result in
   10 GB of data transferred and about two minutes of time; I think that's
   still pretty reasonable for an uncommon case.

So while the approach used in the proof-of-concept might be non-ideal, I tend
to think it should work reasonably well in a variety of use cases.

I also think that this approach should even be releasable in the form of an
MVP, accompanied with a UI option to select between the two states during
checkout (all pristines / pristines-on-demand) that is persisted in the
working copy.

A small final note is that I could be missing some details or other cases,
but in the meantime I felt like sharing the thoughts I had while working on
the proof-of-concept.


Thanks,
Evgeny Kotkov