You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@subversion.apache.org by Greg Hudson <gh...@mit.edu> on 2001/12/14 00:44:48 UTC

Newlines, preserving data, and multiple access paths

I have some new thoughts on newline translation, after talking to some
MIT friends about it.

1. We can avoid irrevocably destroying data if we make sure all
   newline translations we do are reversible.  A newline translation
   is reversible if there are no CRs or LFs in the file which aren't
   source-format newlines.

   This means we can go back to Ben's proposal, and as long as we add
   this safety, we don't have to worry about destroying anyone's
   engine designs.  If the engine design was made on Windows and
   happens to only contain CRLFs, they will get translated to LF on
   checkin, but translating LF back to CRLF will restore the file.  If
   the engine design contains CRLFs mixed with LFs and CRs, we can
   error out, or decide that the file must be binary after all.

   (If we want to go a little overboard on safety, we could make the
   client library set a property on each commit saying what newline
   translation was done, if any.  Then it would be easy to retrieve
   the exact contents of the committed file by reversing the
   translation.  I don't think this is necessary, though.)

2. Unfortunately, as I noted in one of my many other messages today,
   *none* of the schemes presented so far will robustly handle tools
   which access the repository through DAV or libsvn_fs, if the tools
   run on varying platforms and aren't forgiving about newlines.  In
   order to do that, we have to actually add the concept of a text
   file to the FS layer.

Here is what I propose:

  * For now, we implement Ben's scheme, with the proviso that we never
    do a non-reversible newline translation.  (This totally messes up
    Karl's poll because it didn't include Ben's scheme.)  The
    repository gets a global format of LF.

  * Tools which use DAV or libsvn_fs must be able to handle LF line
    separators.  All Unix tools will be okay.  Most Windows tools will
    probably also be okay because they know they're getting data over
    the net where not everyone uses the same newline style.  (And most
    Mac tools will probably be okay because MacOS X is already
    schizophrenic about newlines.)

  * If the above turns out to be a problem, we can talk about changing
    the concept of the FS layer.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Newlines, preserving data, and multiple access paths

Posted by Bruce Atherton <br...@callenish.com>.

At 11:36 AM 12/14/2001 -0600, Ben Collins-Sussman wrote:
>Bruce's system seems a tad more complicated to implement, since it
>seems to require some kind of auto-detection of EOL style when a
>text-base is first received from the server.

As I mentioned in part two of my proposal, you could choose to store a 
second property on a file that indicated what line ending it has in the 
repository, but that would still be a property that only the client would 
use. It isn't required, but it may be more efficient.

>   And it also needs to
>'remember' that a transform happened previously; either that, or
>re-run the detection heuristic on text-base each time the working file
>is committed.

I was thinking of "remember", perhaps in the entries file? Except that 
integrates it a little more into the client than may be desirable. In the 
abstract, I was thinking more like a set of transforms (line endings, 
keywords, whatever) that could be plugged in to a client or not depending 
on user preference, and that would provide perhaps three callbacks 
(transform_stream, reverse_transform_stream, requires_transform). In the 
concrete, of course, that probably all goes out the window.

>Please correct me if I'm wrong.  My brain is spinning, and I'm so
>tired of reading/thinking about this issue.

Me too. I'd given up on posting anything more on the topic, but thought 
these clarifications might be helpful.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Newlines, preserving data, and multiple access paths

Posted by Greg Hudson <gh...@MIT.EDU>.

On Fri, 2001-12-14 at 15:48, William Uther wrote:
> --On Friday, 14 December 2001 1:16 PM -0500 Greg Hudson <gh...@MIT.EDU> 
> wrote:

> >   If newline-style is LF, CR, or CRLF, translate <native newline style>
> > -> <requested newline style>.  If we notice any CRs or LFs which aren't
> > part of a native-style newline and aren't part of a requested-style
> > newline, abort the commit.  If the commit succeeds, apply the <native
> > newline style> -> <requested newline style> translation to the working
> > copy as well, so that it matches what we would get from a checkout of
> > the new rev.
> 
> I don't think this preserves reversability.  If a file contains BOTH 
> <native-style newline> and <requested-style newline> then you neet to 
> abort.  If you translate just <native-style newline> then you can't undo 
> the transformation - you don't know which newlines need to be untransformed.

This particular transform (for files marked CRLF, CR, or LF) is not
reversible.  See where I said:

  We probably don't have to worry so much about data safety for
  these files since a particular, odd behavior has been specified for
  them.

However, let's add a possible variation to my proposal, for those who
are still uncomfortable with data-destroying transformations applied to
such flies:

  Variation 5: If the file is marked CRLF, CR, or LF, we translate
<native-style newline> to <requested-style newline> during commit, and
abort the commit if we notice any kind of mixing of newline styles.
(Can also combine with variation 1.)


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Newlines, preserving data, and multiple access paths

Posted by Colin Putney <cp...@whistler.com>.

William Uther wrote:

> --On Friday, 14 December 2001 1:16 PM -0500 Greg Hudson 
> <gh...@MIT.EDU> wrote:
>
>>   If newline-style is LF, CR, or CRLF, translate <native newline style>
>> -> <requested newline style>.  If we notice any CRs or LFs which aren't
>> part of a native-style newline and aren't part of a requested-style
>> newline, abort the commit.  If the commit succeeds, apply the <native
>> newline style> -> <requested newline style> translation to the working
>> copy as well, so that it matches what we would get from a checkout of
>> the new rev.
>
> I don't think this preserves reversability.  If a file contains BOTH 
> <native-style newline> and <requested-style newline> then you neet to 
> abort.  If you translate just <native-style newline> then you can't 
> undo the transformation - you don't know which newlines need to be 
> untransformed.
>
> Stated simply: You should only translate when the newline style is 
> entirely consistent.  Anything else removes the inconsistency and hence 
> loses information.

True, this scheme doesn't preserve reversibility. But in this case 
that's OK, because the newline-style decrees what the newline style must 
be. If there are native-style newlines mixed in with the requested-style 
newlines, this is probably the result of corruption by some 
native-newline-obsessive user tool. So the non-reversible transform will 
actually undo the corruption.

For example, the file foo.dsp, which has newline-style of CRLF. It's 
stored in the repository with CRLF newlines and on checkout, no 
transformation is done. If Linus checks out the file and edits it in an 
old version of emacs, any lines he adds will be terminated with a bare 
LF. Since this is his native style of newline, the transformation Greg 
described will undo this damage.

If the newline-style is set to a specific newline-style (ie. CR, LF, or 
CRLF), then we know that (1) the file is text, not binary, and (2), any 
other style of newline present is corruption.

A file should not be marked with a specific newline style unless (1) 
user does so explicitly, or (2) it matches some heuristic when it's 
added, *and* the file contents conform to that newline style.

So the only real possibility for corruption is if some user tool creates 
a binary file that matches a heuristic for a specific newline style. In 
our running example, William creates a vector graphics file called 
foo.dsp and adds it. By chance, this file happens to have CRLFs 
scattered though it, but no bare CRs, LFs, '\0' characters or other 
harbingers of binary files. On the commit, svn will notice the 
extension, set the newline-style to CRLF and send it to the repository. 
William may get an error if he tries to commit a change that introduces 
a bare CR or LF, but he won't corrupt the file.

Linus can corrupt the file if he makes a change that introduces a bare 
LF, which will get transformed into CRLF on commit. Alternatively, 
Madeleine (was that her name?) Can introduce a bare CR and commit, which 
will also corrupt the file.

That's a pretty long string of unlikely coincidences though, while the 
opposite case, where this transformation *fixes* corruption, is quite 
common.

Colin

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Newlines, preserving data, and multiple access paths

Posted by Karl Fogel <kf...@newton.ch.collab.net>.

William Uther <wi...@cs.cmu.edu> writes:
> >   If newline-style is LF, CR, or CRLF, translate <native newline style>
> > -> <requested newline style>.  If we notice any CRs or LFs which aren't
> > part of a native-style newline and aren't part of a requested-style
> > newline, abort the commit.  If the commit succeeds, apply the <native
> > newline style> -> <requested newline style> translation to the working
> > copy as well, so that it matches what we would get from a checkout of
> > the new rev.
> 
> I don't think this preserves reversability.  If a file contains BOTH
> <native-style newline> and <requested-style newline> then you neet to
> abort.  If you translate just <native-style newline> then you can't
> undo the transformation - you don't know which newlines need to be
> untransformed.
> 
> Stated simply: You should only translate when the newline style is
> entirely consistent.  Anything else removes the inconsistency and
> hence loses information.

I think that's what Greg H is saying, he just said it differently.  He
didn't mean to tolerate files with mixed line endings, just that if
the modified file happens to *match* the specified ending (i.e., any
conversion that took place was entirely undone by some user tool),
that shouldn't be cause for aborting the commit.

-Karl

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Newlines, preserving data, and multiple access paths

Posted by William Uther <wi...@cs.cmu.edu>.

--On Friday, 14 December 2001 1:16 PM -0500 Greg Hudson <gh...@MIT.EDU> 
wrote:

>   If newline-style is LF, CR, or CRLF, translate <native newline style>
> -> <requested newline style>.  If we notice any CRs or LFs which aren't
> part of a native-style newline and aren't part of a requested-style
> newline, abort the commit.  If the commit succeeds, apply the <native
> newline style> -> <requested newline style> translation to the working
> copy as well, so that it matches what we would get from a checkout of
> the new rev.

I don't think this preserves reversability.  If a file contains BOTH 
<native-style newline> and <requested-style newline> then you neet to 
abort.  If you translate just <native-style newline> then you can't undo 
the transformation - you don't know which newlines need to be untransformed.

Stated simply: You should only translate when the newline style is entirely 
consistent.  Anything else removes the inconsistency and hence loses 
information.

later,

\x/ill         :-}

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Newlines, preserving data, and multiple access paths

Posted by Karl Fogel <kf...@newton.ch.collab.net>.

Greg Hudson <gh...@MIT.EDU> writes:
> No... it just means that if the mod times force a contents check, you
> have to translate the text-base contents as you compare them against the
> normal contents.  That's "a teeny tiny bit slower," not, "a lot slower."

It's worse than that -- it invalidates our size check.  Right now,
differing file sizes *guarantee* that a modification was made.  With
eol conversion, there can be different sizes with no local mods.

So we'd lose one early return from text_modified_p().

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Newlines, preserving data, and multiple access paths

Posted by Greg Hudson <gh...@MIT.EDU>.

On Fri, 2001-12-14 at 17:44, Karl Fogel wrote:
> +1 on Greg Hudson's latest proposal -- and I think we're now ready to
> Actually Do It. :-)

I hope so.  For a while I was afraid we had hit our first failure to
achieve livable consensus.  My apologies for not realizing the
reversability thing until two days and several thousand lines of
misguided debate had already gone by.

> My assumption is that "in the working copy" means both text-base and
> working file, for the sake of an efficient is-modified-p test, and
> since the repository file is just an automatic transform off the
> text-base anyway.

Actually, I was assuming that text-base would be a verbatim copy of the
repository contents.  But that's kind of an implementation detail; let's
leave that up to Ben (assuming he's doing the implementation).

> Otherwise, then the is-modified-p check has to be tweaked in a way
> that will make modifiedness checks a lot slower in some cases.

No... it just means that if the mod times force a contents check, you
have to translate the text-base contents as you compare them against the
normal contents.  That's "a teeny tiny bit slower," not, "a lot slower."

> The second sentence of the above paragraph isn't about allowing
> mixed-style files.  It's saying that if the entire file is native
> format, allow that (and transform when necessary), OR if the entire
> file is in the requested style, then allow that too.  The latter
> situation could happen if someone used a LF-style tool under Windows,
> for example, so that when an LF-style file got saved, the whole thing
> would be LF-style now, not native style.  No reason to disallow this.
> 
> Right?

See my last message, as well as Colin Putney's argument.  In summary,
that's not actually what I meant, but I don't really care either way.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Newlines, preserving data, and multiple access paths

Posted by Karl Fogel <kf...@newton.ch.collab.net>.

+1 on Greg Hudson's latest proposal -- and I think we're now ready to
Actually Do It. :-)

It's clear that no solution is perfect, because doing eol conversion
raises some inherently unresolvable questions.  But my sense from
recent discussions is that everyone here can live with the choices
this proposal makes; also, it follows the Principle Of Least Surprise,
and is highly unlikely to damage data unfixably.

Below I'll quote his proposal, with a few annotations reflecting my
understanding of certain points, just to make sure.

If you have a *violent* objection, please post; otherwise, please do
not.  We're looking for liveable consensus now, not further
refinements that would help some border cases and harm others. :-)

> Alright, I'll make a proposal which is like yours but (in my opinion) a
> little clearer.  First, let's look at the different use cases:
> 
>   1. The most common case--text files which want native line endings. 
> These should be stored in the repository using LF line endings, and in
> the working dir using native line endings.
> 
>   2. Binary files.  These files we don't want to touch at all.
> 
>   3. Text files which, for one reason or another, want a specific line
> ending format regardless of platform.  These should be stored in the
> repository and in the working directory using the specified line
> ending.  We probably don't have to worry so much about data safety for
> these files since a particular, odd behavior has been specified for
> them.
> 
> There are, of course, a hundred different ways we could arrange the
> metadata.  I propose an "svn:newline-style" property with the possible
> values "none", "native", "LF", "CR", and "CRLF".  The values mean:
> 
>   none: Use case 2.  don't do any newline translation
> 
>   native: Use case 1.  Store with LF in repository, and with native line
> endings in the working copy.

My assumption is that "in the working copy" means both text-base and
working file, for the sake of an efficient is-modified-p test, and
since the repository file is just an automatic transform off the
text-base anyway.

If that's what you meant, then incoming svndiff has to be applied to a
deconverted tmp file, which then becomes the new text-base.  No
problem.

Otherwise, then the is-modified-p check has to be tweaked in a way
that will make modifiedness checks a lot slower in some cases.

>   LF, CR, CRLF: Use case 3.  Store with specified format in the
> repository and in the working copy.
> 
> On commit, we apply the following rules to transform the data committed
> to the server:
> 
>   If newline-style is none, do nothing.
> 
>   If newline-stle is native, translate <native newline style> -> LF.  If
> we notice any CRs or LFs which aren't part of a native-style newline,
> abort the commit.
> 
>   If newline-style is LF, CR, or CRLF, translate <native newline style>
> -> <requested newline style>.  If we notice any CRs or LFs which aren't
> part of a native-style newline and aren't part of a requested-style
> newline, abort the commit.  If the commit succeeds, apply the <native
> newline style> -> <requested newline style> translation to the working
> copy as well, so that it matches what we would get from a checkout of
> the new rev.

The second sentence of the above paragraph isn't about allowing
mixed-style files.  It's saying that if the entire file is native
format, allow that (and transform when necessary), OR if the entire
file is in the requested style, then allow that too.  The latter
situation could happen if someone used a LF-style tool under Windows,
for example, so that when an LF-style file got saved, the whole thing
would be LF-style now, not native style.  No reason to disallow this.

Right?

> On checkout, we translate LF -> <native newline style> if newline-style
> is native; otherwise, we leave the file alone.

Yup.

> For now, let's say the default value of svn:newline-style is none.  In
> the future, we'll want to think about things like how to enable
> newline-translation over the whole repository except for files which
> don't appear to be text.

Agree.  Let's wait and let real-life use cases drive how we do mass
enablings.

I don't see any need for any of the "Variations" right now.  Let's see
how the above works first.

-Karl

> I think that's a complete proposal.  Some possible variations:
> 
>   Variation 1: If newline-style is native, on commit, translate <first
> newline style seen> -> LF.  If we see any CRs or LFs which don't match
> the first newline style seen, abort the commit.
> 
>   Variation 2: If newline-style is native, before commit, examine the
> file to see if it uses only the native newline style.  If it doesn't,
> set the newline-style property to "none" and commit with no translation.
> 
>   Variation 3: Combine variations 1 and 2; if newline-style is native,
> then if before commit, examine the file to see if it uses a single
> consistent newline style.  If it does, translate <that newline style> ->
> LF; if not, commit with newline-style set to "none" and no translation.
> 
>   Variation 4: If newline-style is native, then on commit, we edit a
> property "svn:newline-conversion" to something like "CRLF LF" to show
> what conversion we did.  This enables mechanical reversal of the
> translation if the file is later determined to be binary.  (Particularly
> useful with variations 1 or 3 where the transform might not be obvious
> from the platform where the file was checked in.)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Newlines, preserving data, and multiple access paths

Posted by Greg Hudson <gh...@MIT.EDU>.

On Fri, 2001-12-14 at 12:36, Ben Collins-Sussman wrote:
> Sorry, what is a 'source-format newline'?

I guess I played a little fast and loose with terminology there.

Let's say we're transforming CRLF to LF.  CRLF is what I meant by
"source-format newline".  If there are any CRs or LFs in the file which
aren't part of a CRLF pair, then the transform is not reversible.

> So, given that we're implementing a transform-on-commit system, the
> only clarification left is how metadata fits in.

Alright, I'll make a proposal which is like yours but (in my opinion) a
little clearer.  First, let's look at the different use cases:

  1. The most common case--text files which want native line endings. 
These should be stored in the repository using LF line endings, and in
the working dir using native line endings.

  2. Binary files.  These files we don't want to touch at all.

  3. Text files which, for one reason or another, want a specific line
ending format regardless of platform.  These should be stored in the
repository and in the working directory using the specified line
ending.  We probably don't have to worry so much about data safety for
these files since a particular, odd behavior has been specified for
them.

There are, of course, a hundred different ways we could arrange the
metadata.  I propose an "svn:newline-style" property with the possible
values "none", "native", "LF", "CR", and "CRLF".  The values mean:

  none: Use case 2.  don't do any newline translation

  native: Use case 1.  Store with LF in repository, and with native line
endings in the working copy.

  LF, CR, CRLF: Use case 3.  Store with specified format in the
repository and in the working copy.

On commit, we apply the following rules to transform the data committed
to the server:

  If newline-style is none, do nothing.

  If newline-stle is native, translate <native newline style> -> LF.  If
we notice any CRs or LFs which aren't part of a native-style newline,
abort the commit.

  If newline-style is LF, CR, or CRLF, translate <native newline style>
-> <requested newline style>.  If we notice any CRs or LFs which aren't
part of a native-style newline and aren't part of a requested-style
newline, abort the commit.  If the commit succeeds, apply the <native
newline style> -> <requested newline style> translation to the working
copy as well, so that it matches what we would get from a checkout of
the new rev.

On checkout, we translate LF -> <native newline style> if newline-style
is native; otherwise, we leave the file alone.

For now, let's say the default value of svn:newline-style is none.  In
the future, we'll want to think about things like how to enable
newline-translation over the whole repository except for files which
don't appear to be text.

I think that's a complete proposal.  Some possible variations:

  Variation 1: If newline-style is native, on commit, translate <first
newline style seen> -> LF.  If we see any CRs or LFs which don't match
the first newline style seen, abort the commit.

  Variation 2: If newline-style is native, before commit, examine the
file to see if it uses only the native newline style.  If it doesn't,
set the newline-style property to "none" and commit with no translation.

  Variation 3: Combine variations 1 and 2; if newline-style is native,
then if before commit, examine the file to see if it uses a single
consistent newline style.  If it does, translate <that newline style> ->
LF; if not, commit with newline-style set to "none" and no translation.

  Variation 4: If newline-style is native, then on commit, we edit a
property "svn:newline-conversion" to something like "CRLF LF" to show
what conversion we did.  This enables mechanical reversal of the
translation if the file is later determined to be binary.  (Particularly
useful with variations 1 or 3 where the transform might not be obvious
from the platform where the file was checked in.)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Newlines, preserving data, and multiple access paths

Posted by Ben Collins-Sussman <su...@collab.net>.

Greg Hudson <gh...@mit.edu> writes:

> 1. We can avoid irrevocably destroying data if we make sure all
>    newline translations we do are reversible.  A newline translation
>    is reversible if there are no CRs or LFs in the file which aren't
>    source-format newlines.

Sorry, what is a 'source-format newline'?

> Here is what I propose:
> 
>   * For now, we implement Ben's scheme, with the proviso that we never
>     do a non-reversible newline translation.  (This totally messes up
>     Karl's poll because it didn't include Ben's scheme.)  The
>     repository gets a global format of LF.

OK, so you're advocating (like everyone else now) that it's okay to do
a 'reverse transform' when committing, provided our transforms are
Safe.  That's a great turn of events!  This was the huge Sticking
Point that differentiated your proposal from mine & Bruce's.  I feel
like a major hurdle has been crossed.

So, given that we're implementing a transform-on-commit system, the
only clarification left is how metadata fits in.  My & Bruce's systems
had slightly different notions how how metadata should work in
determining system behavior.

  * In my system, an EOL property defined how a file should look in
    the repository.  The client was responsible for making sure that
    this style was always committed to the repository.  If this
    property was non-existent, the client assumes it has a value of
    'LF'.  Then there was a -second- property that enabled one to
    switch EOL conversion on/off per file.  The absence of this second
    property can imply EOL is either on or off by default; I don't
    care which.

  * In Bruce's system, he had only one property - namely, the on/off
    switch.  If the property was 'on', then a committed file would be
    reverse-transformed on commit, assuming that a transform had
    originally happened on checkout.

Bruce's system seems a tad more complicated to implement, since it
seems to require some kind of auto-detection of EOL style when a
text-base is first received from the server.  And it also needs to
'remember' that a transform happened previously; either that, or
re-run the detection heuristic on text-base each time the working file
is committed.

Please correct me if I'm wrong.  My brain is spinning, and I'm so
tired of reading/thinking about this issue.  I just want to code
already.  :-)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Newlines, preserving data, and multiple access paths

Posted by Karl Fogel <kf...@newton.ch.collab.net>.

This is a real issue, but we have enough to do already.  Let's please
put this kind of thing off till post-1.0.

That'll be a wonderful day, the day we have problems because so many
non-Subversion tools are accessing Subversion repositories.  That day
is not today. :-)

-K


Mark Benedetto King <bk...@answerfriend.com> writes:
> > 2. Unfortunately, as I noted in one of my many other messages today,
> >    *none* of the schemes presented so far will robustly handle tools
> >    which access the repository through DAV or libsvn_fs, if the tools
> >    run on varying platforms and aren't forgiving about newlines.  In
> >    order to do that, we have to actually add the concept of a text
> >    file to the FS layer.
> > 
> 
> I proposed a solution on IRC to handle this case.  It seems to me that
> what we want here is something like a "view", i.e., a WC-specific set
> of properties.  What if we embed the WC-desired CR/NL/CRNL semantics
> *in* the request URL?
> 
> 	http://svn.collab.net/CR/repos/svn/trunk
> 	http://svn.collab.net/NL/repos/svn/trunk
> 	http://svn.collab.net/CRNL/repos/svn/trunk
> 
> And then let an apache module sort out the rewriting?
> 
> Alternatively, we could do:
> 
> 	http://svn.collab.net/repos/svn/trunk?record=CR
> 	http://svn.collab.net/repos/svn/trunk?record=NL
> 	http://svn.collab.net/repos/svn/trunk?record=CRNL
> 
> And let mod_dav sort out the rewriting. I'm not sure
> if all DAV tools can include a query-string, though.
> 
> Another alternative would be to use an SVN branch that
> had alternate default properties:
> 
> 	http://svn.collab.net/repos/CR/svn/trunk
> 	http://svn.collab.net/repos/NL/svn/trunk
> 	http://svn.collab.net/repos/CRNL/svn/trunk
> 
> This would require server-side implementation of the
> separator semantics (which goes against the current 
> proposals, but does clean up this mess).  Also, these
> branches would probably need to be read-only.
> 
> 
> Comments?
> 
> --ben
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
> For additional commands, e-mail: dev-help@subversion.tigris.org

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Newlines, preserving data, and multiple access paths

Posted by Mark Benedetto King <bk...@answerfriend.com>.

On Thu, Dec 13, 2001 at 07:44:48PM -0500, Greg Hudson wrote:
> I have some new thoughts on newline translation, after talking to some
> MIT friends about it.
> 
> 1. We can avoid irrevocably destroying data if we make sure all
>    newline translations we do are reversible.  A newline translation
>    is reversible if there are no CRs or LFs in the file which aren't
>    source-format newlines.
> 
>    This means we can go back to Ben's proposal, and as long as we add
>    this safety, we don't have to worry about destroying anyone's
>    engine designs.  If the engine design was made on Windows and
>    happens to only contain CRLFs, they will get translated to LF on
>    checkin, but translating LF back to CRLF will restore the file.  If
>    the engine design contains CRLFs mixed with LFs and CRs, we can
>    error out, or decide that the file must be binary after all.
> 
>    (If we want to go a little overboard on safety, we could make the
>    client library set a property on each commit saying what newline
>    translation was done, if any.  Then it would be easy to retrieve
>    the exact contents of the committed file by reversing the
>    translation.  I don't think this is necessary, though.)

I totally agree.  Another way to look at this is that if a file
has mixed separators, it's a binary file.  Of course, that means
an O(n) scan for determination of "binaryness", but maybe we should
do that anyway (the current heuristic only looks at the first bit
of the file).

> 
> 2. Unfortunately, as I noted in one of my many other messages today,
>    *none* of the schemes presented so far will robustly handle tools
>    which access the repository through DAV or libsvn_fs, if the tools
>    run on varying platforms and aren't forgiving about newlines.  In
>    order to do that, we have to actually add the concept of a text
>    file to the FS layer.
> 

I proposed a solution on IRC to handle this case.  It seems to me that
what we want here is something like a "view", i.e., a WC-specific set
of properties.  What if we embed the WC-desired CR/NL/CRNL semantics
*in* the request URL?

	http://svn.collab.net/CR/repos/svn/trunk
	http://svn.collab.net/NL/repos/svn/trunk
	http://svn.collab.net/CRNL/repos/svn/trunk

And then let an apache module sort out the rewriting?

Alternatively, we could do:

	http://svn.collab.net/repos/svn/trunk?record=CR
	http://svn.collab.net/repos/svn/trunk?record=NL
	http://svn.collab.net/repos/svn/trunk?record=CRNL

And let mod_dav sort out the rewriting. I'm not sure
if all DAV tools can include a query-string, though.

Another alternative would be to use an SVN branch that
had alternate default properties:

	http://svn.collab.net/repos/CR/svn/trunk
	http://svn.collab.net/repos/NL/svn/trunk
	http://svn.collab.net/repos/CRNL/svn/trunk

This would require server-side implementation of the
separator semantics (which goes against the current 
proposals, but does clean up this mess).  Also, these
branches would probably need to be read-only.

Comments?

--ben

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Yet another line-end proposal (YALEP?)

Posted by Karl Fogel <kf...@newton.ch.collab.net>.

Philip Martin <ph...@codematters.co.uk> writes:
>  - There is a native-line-end property that can be set on a file. I am
>    not sure if this is a separate property from the text/binary thing
>    as I am not sure what the text/binary thing does at present!

Philip, to answer your implied question:

A file needs to know if it is text vs binary so that the client can
use `diff' and `patch' to merge repository changes into a locally
modified file.  For binary files, the client won't even try to merge,
it just gives you both copies and lets you figure out what to do (we
plan support for pluggable merge tools, but that's not done yet).

Thus, text vs binary would be relevant even if we weren't supporting
newline conversion nor keyword substitution.

-Karl

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Yet another line-end proposal (YALEP?)

Posted by Philip Martin <ph...@codematters.co.uk>.

Greg Hudson <gh...@mit.edu> writes:

> 1. We can avoid irrevocably destroying data if we make sure all
>    newline translations we do are reversible.  A newline translation
>    is reversible if there are no CRs or LFs in the file which aren't
>    source-format newlines.

This is the property I have been trying to use.

Background (feel free to skip this para it explains why I wan't this,
not how it works) I worked with ClearCase on a C++ project in a mixed
Unix/NT environment earlier this year.  ClearCase does line-ending
conversion on a per-view basis for text files (a view is a working-
copy in subversion terms). We used views that were mounted on both
Unix and NT boxes, i.e. one view was simultaneously mounted on both
machines. This made cross platform development much easier as changes
could be built on both platforms without requiring them to be either
checked-in, or manually copied between views. However it caused real
problems if line-ending conversion was enabled. A view set up to use
NT line-endings say was hard to use from a Unix box, since every line
has already changed. Merging, which is normally a ClearCase strong
point, was disrupted if a set of line-end changes got eroneously
committed. We ended up abandoning the line-end conversion, and using
pre-commit triggers to produce the line-endings we wanted (.ds[pw]'s
had CRLF, source code plain LF).  This initially disposes me to
support no line-end conversion, and for the default to be off if it is
present. However given that lots of people want it, take a deep breath
and here goes.

Proposal
========

 Rules:

 - The text-base always duplicates the repository.

 - Any sort of line-ending can appear in any file in the repository.

 - There is a native-line-end property that can be set on a file. I am
   not sure if this is a separate property from the text/binary thing
   as I am not sure what the text/binary thing does at present!

 Rules when native-line-end is not set:

 - If the property is not set no line-end conversion occurs. The
   working-copy duplicates the repository. File get commited exactly
   as they appear in the working-copy, just your straight binary file.

 Rules when native-line-end is set:

 - At check-out/update/revert convert all line-endings in the working
   copy to whatever the platform requires. Store the platform line-end
   property in the .svn/entries file (or wherever) to allow checkout
   with Unix client and check-in with non-Unix client or vice
   versa. The .svn/entries property is "none" or "LF" or "CRLF" etc.,
   i.e. an explict line-ending and not just "native".

 - At check-out/update/revert there is a -no-convert option to disable
   line-end conversion, overriding the native-line-end property. This
   also changes the line-end property in the .svn/entries file.

 - At commit check the .svn/entries file to determine the
   line-end property. When generating the diff between the
   working-copy and the text-base if a line-end difference is
   explained by the line-ending conversion ignore it. If the
   introduced line-endings are incompatible with the .svn/entries
   line-end property display an error.

 Diff Algorithm:

 The diff algorithm is basically as follows: do the line-ending
 conversion specified in the .svn/entries file on the text-base to
 generate the pristine working-copy. Diff the pristine working-copy
 and the actual working copy. Within the diff, undo the line-ending
 conversion on the diff for those parts that represent the
 text-base. Within the diff, verify that all line-endings on for those
 parts that represent the working-copy are consistent with the
 .svn/entries property.  This diff is now suitable to send to the
 repository.

Advantages
==========

 - On the wire and repository diff's are small.
 - The working copy file gets commited exactly and does not change.[1]
 - Any working-copy file that gets comitted can always be retrieved
   exactly.
 - If an erroneously converted working-copy gets commited the
   corruption does not in general get back into the repository.

[1] Any automatic conversion system has to allow the conversion
enabling property to be unset. When this property change is commited
the working copy needs to be changed to match the repository. This
applies whatever scheme we use. Perhaps it should occur when the user
does the propset rather than waiting until the commit?

Disadvantages
=============

 - More complicated diff algorithm, I'm not even sure the vdelta
   algorithm can be made to operate this way.
 - Something I haven't thought of...

Examples
========

 Scenario 1: text file with native-line-end property
 ----------

 check-out:   text-base    CRLF working-copy
               abc\n        abc\r\n
               def\n        def\r\n
               ghi\n        ghi\r\n

 edit:        text-base    CRLF working-copy
               abc\n        abc\r\n
               def\n        XXX\r\n
               ghi\n        ghi\r\n

 diff:                     CRLF working-copy
                           -def\n
                           +XXX\r\n

 commit:      text-base    CRLF working-copy
               abc\n        abc\r\n
               XXX\r\n      XXX\r\n
               ghi\n        ghi\r\n

 Note that the working-copy does not need to change at commit, and
 remains what would appear if the user checked-out on this platform.

 check-out:   text-base    LF working-copy
               abc\n        abc\n
               XXX\r\n      XXX\n
               ghi\n        ghi\n

 edit:        text-base    LF working-copy
               abc\n        abc\n
               XXX\r\n      YYY\n
               ghi\n        ghi\n

 diff:                     LF working-copy
                           -XXX\r\n
                           +YYY\n

 commit:      text-base    LF working-copy
               abc\n        abc\n
               YYY\n        YYY\n
               ghi\n        ghi\n

 Note that once again the working-copy does not need to change at
 commit.

 Scenario 2: binary file with erroneous native-line-end property
 ----------

 add:         text-base    LF working-copy

 The .svn/entries line-end indicates LF the platform native.

 edit:        text-base    LF working-copy
                            some\n
                            binary\r\n
                            data

 diff:                     LF working-copy
                           +some\n
                           +binary\r\n
                           +data

 Note that the diff contains line-end changes that are incompatible
 with the native-line-end property. This might trigger the error, or
 it may be delayed until the commit. The commit fails unless the user
 removes the native-line-end property

 commit:      text-base    LF working-copy
               some\n       some\n
               binary\r\n   binary\r\n
               data         data

 Note that this can only be commited without line-end conversion.

 Scenario 3: binary file with erroneous native-line-end property
 ----------

 add:         text-base    LF working-copy

 edit:        text-base    LF working-copy
                            more\n
                            binary\n
                            stuff

 commit:      text-base    LF working-copy
               more\n       more\n
               binary\n     binary\n
               stuff        stuff

 Here the binary does not have a conflicting line-ending, so the
 commit succeeds.

 check-out:   text-base    CRLF working-copy
               more\n       more\r\n
               binary\n     binary\r\n
               stuff        stuff

 Here the working-copy is corrupt. If the user recognises this the
 native-line-end property can be changed and commited. This, as in any
 other scheme, has to update the working-copy. Then the user has the
 correct binary file. If the user does not have commit access, they
 can use the -no-convert option to get a valid working-copy.

 check-out:   text-base    CRLF working-copy
 -no-convert   more\n       more\n
               binary\n     binary\n
               stuff        stuff

 If the corruption is unnoticed, and the user continues, the amount of
 corruption in the repository is "stable", i.e. the working copy
 corruption will not get propogated into the repository. As follows

 check-out:   text-base    CRLF working-copy
               more\n       more\r\n
               binary\n     binary\r\n
               stuff        stuff

 Note the working-copy is corrupt

 edit:        text-base    CRLF working-copy
               more\n       more\r\n
               binary\n     binary\r\n
               stuff        stuffadded

 diff:                     CRLF working-copy
                           -stuff
                           +stuffadded

 commit:      text-base    CRLF working-copy
               more\n       more\r\n
               binary\n     binary\r\n
               stuffadded   stuffadded

 Of course the resulting binary may be useless, but any scheme that
 does automatic line-end conversion can produce temporary corruption,
 and if this is not noticed problems will inevitably occur.

Hmm, 3:45am, time for bed said Zebedee

-- 
Philip

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org