You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Philip Martin <ph...@codematters.co.uk> on 2002/08/28 20:12:46 UTC

[RFC] Canonical Paths

Blair Zajac <bl...@orcaware.com> writes:

> > > I can't cleanup the directory, nor revert win-tests.py.
> > >
> > > Any ideas?
> > 
> > Does avoiding the troublesome "." path work? Try
> > 
> > $ cd ..
> > $ svn cleanup svn
> 
> Thanks, that did the trick.
> 
> We gotta nail that bug soon.

Hmmm, I think I may do this, I am running into more and more problems
as I work on issue 749 and introduce the access baton.

I believe the situation is as follows:

1. The Subversion libraries only handle canonical paths, and all the
   functions should passed canonical paths. The only exceptions to
   this rule are svn_path_internal_style, svn_path_canonicalize and
   svn_path_canonicalize_nts, which obviously can accept non-canonical
   paths.

2. The application is responsible for canonicalizing all user input
   paths before passing them to the Subversion.

3. The canonical form for the path representing the current directory
   could be either "." or "".  Currently the path library is not
   consistent.  We should pick one, and then we do not need to handle
   the other.  APR is know to have some problems using "", so "."
   would be the one I would choose.

4. Functions like svn_path_split, svn_path_remove_component, etc
   should only return canonical paths.

5. In some places the path library handles NULL paths.  This is
   unnecessary, as all inputs should be canonical, and so should be
   removed (fixing any breakage that results).  

Have I missed anything?  Does anyone dispute any of the above?

One question: svn_path_internal_style, which converts from native
separators to canonical separators, doesn't seem to be used.  Should
the application pass all user input paths through this function?

-- 
Philip Martin

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC] Canonical Paths

Posted by Marcus Comstedt <ma...@mc.pp.se>.
Scott Lamb <sl...@slamb.org> writes:

> I don't think perfection is possible - there are always going to be
> some paths you can't represent. In particular, Unix does not enforce
> any character set on filenames - they just can't contain null or
> '/'. So if you run Subversion in the UTF-8 locale, I think you will be
> unable to access an accented filename created by a program operating
> in a iso-8859-1 locale, since it would be an invalid UTF-8
> sequence. (And conversely, in the iso-8859-1 charset, it couldn't
> access some UTF-8
> 
> files.) I don't think that problem can be solved, short of redesigning
> the Unix filesystem model to be charset-aware.

As I have already explained, making sure the repository contains files
whose names can be represented on the systems relevant to that project
should be left to project policy.  If that particular repository is
never going to be checked out on a UNIX system with a iso-8859-1
locale, then there is no reason to impose that particular restriction
on that particular repository.  Since svn can't know about the goals
and policies of the project, it should (ideally) leave the set of
filenames totally unconstrained.  The constraints will instead come
from the OS where the files are created (since if you can't create the
file, you can't add it to the repository), and from project policy
(either informally, or enforced through commit hooks).


  // Marcus



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC] Canonical Paths

Posted by Scott Lamb <sl...@slamb.org>.
Marcus Comstedt wrote:
> David Waite <ma...@akuma.org> writes:
>>How many files in your CVS are named '.'? ;-)
> 
> That's beside the point.  I don't like the principle of disallowing
> some pathnames, regardless of how silly or seldom used they are.

I don't think perfection is possible - there are always going to be some 
paths you can't represent. In particular, Unix does not enforce any 
character set on filenames - they just can't contain null or '/'. So if 
you run Subversion in the UTF-8 locale, I think you will be unable to 
access an accented filename created by a program operating in a 
iso-8859-1 locale, since it would be an invalid UTF-8 sequence. (And 
conversely, in the iso-8859-1 charset, it couldn't access some UTF-8 
files.) I don't think that problem can be solved, short of redesigning 
the Unix filesystem model to be charset-aware.

-- 
Scott Lamb


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC] Canonical Paths

Posted by Justin Erenkrantz <je...@apache.org>.
On Thu, Aug 29, 2002 at 12:19:16AM +0100, Philip Martin wrote:
> I'm not a HTTP/WebDAV expert, but it would not surprise me if
> filenames containing '/' characters are not valid.  Even if there is a
> valid encoding don't expect the working copy to work on Unix.

Just use URL-encoding.  -- justin

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC] Canonical Paths

Posted by Philip Martin <ph...@codematters.co.uk>.
Greg Stein <gs...@lyra.org> writes:

> Right. Changing SVN_PATH_SEPARATOR would not work. In fact, I'm surprised
> that the symbol still exists. *THE* path separator is '/'. It is the
> canonical/internal separator for the code, it is used for URLs, and it is
> used in the repository. We really can't change that.

Well the symbol I was thinking of is a #define that is local to
libsvn_subr/path.c.  I haven't tried changing it, and I suppose if we
rely on using bits of paths as URLs then changing it won't work.

-- 
Philip Martin

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC] Canonical Paths

Posted by Greg Stein <gs...@lyra.org>.
Right. Changing SVN_PATH_SEPARATOR would not work. In fact, I'm surprised
that the symbol still exists. *THE* path separator is '/'. It is the
canonical/internal separator for the code, it is used for URLs, and it is
used in the repository. We really can't change that.

Mac users won't be able to use '/' in their pathnames. Sorry, too bad.

Cheers,
-g

On Thu, Aug 29, 2002 at 01:05:56AM +0200, Marcus Comstedt wrote:
> 
> Philip Martin <ph...@codematters.co.uk> writes:
> 
> > On that platform we can change SVN_PATH_SEPARATOR to ':' and it may
> > just work.
> 
> So the canonicalization is only canonical within the scope of a
> particular client?  There are no interoperability problems with
> different clients using different values for SVN_PATH_SEPARATOR?
> In that case I'll breathe a little easier.
> 
> But what about the repository URL for a file whose name contains a
> "/"?  If my repository is rooted at http://www.example.com/repos/foo
> and I import "bar:gaz/onk" (that is, the file "gaz/onk" in the
> directory "bar") into it with a client using ':' for
> SVN_PATH_SEPARATOR, what URL will the imported file get?
> 
> 
>   // Marcus
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
> For additional commands, e-mail: dev-help@subversion.tigris.org

-- 
Greg Stein, http://www.lyra.org/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC] Canonical Paths

Posted by Marcus Comstedt <ma...@mc.pp.se>.
=?UTF-8?B?QnJhbmtvIMSMaWJlag==?= <br...@xbc.nu> writes:

> >Hm, now that I read your message a second time I see that you propose
> >to change the _canonical path separator_ to 0xfe.  Why would you want
> >to do that?  Leaving the canonical path separator as '/' (as in my
> >proposal) avoids changing any code that deals with path separator at
> >all.
> >
> 
> 
> Any code that deals with the path separator and doesn't use
> SVN_PATH_SEPARATOR is broken anyway. And all that code is in exactly
> one file, libsvn_subr/path.c.
> 
> 
> And anyway, I didn't say I want to do that, just that I think it's
> better than your plan to encode "dangerous" chars with invalid UTF-8
> 
> sequences.

But you didn't say _why_ you thought it better.  You would still have
invalid UTF-8 sequences (only now you'd have them all the time), the
only difference would be that you have to do an extra conversion when
creating URLs since SVN_PATH_SEPARATOR would no longer be the URL path
separator.


  // Marcus



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC] Canonical Paths

Posted by Branko Čibej <br...@xbc.nu>.
Marcus Comstedt wrote:

>=?UTF-8?B?QnJhbmtvIMSMaWJlag==?= <br...@xbc.nu> writes:
>
>  
>
>>Now, if you want to do tricks like that: there are two single bytes
>>that are invalid in UTF-8: these are 0xfe (11111110) and 0xff
>>(11111111), and they also happen to cooperate quite happily with the C
>>string functions. We could use one of those as the canonical path
>>separator.
>>    
>>
>
>
>Hm, now that I read your message a second time I see that you propose
>to change the _canonical path separator_ to 0xfe.  Why would you want
>to do that?  Leaving the canonical path separator as '/' (as in my
>proposal) avoids changing any code that deals with path separator at
>all.
>  
>

Any code that deals with the path separator and doesn't use 
SVN_PATH_SEPARATOR is broken anyway. And all that code is in exactly one 
file, libsvn_subr/path.c.

And anyway, I didn't say I want to do that, just that I think it's 
better than your plan to encode "dangerous" chars with invalid UTF-8 
sequences.

-- 
Brane Čibej   <br...@xbc.nu>   http://www.xbc.nu/brane/


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC] Canonical Paths

Posted by Marcus Comstedt <ma...@mc.pp.se>.
=?UTF-8?B?QnJhbmtvIMSMaWJlag==?= <br...@xbc.nu> writes:

> Now, if you want to do tricks like that: there are two single bytes
> that are invalid in UTF-8: these are 0xfe (11111110) and 0xff
> (11111111), and they also happen to cooperate quite happily with the C
> string functions. We could use one of those as the canonical path
> separator.


Hm, now that I read your message a second time I see that you propose
to change the _canonical path separator_ to 0xfe.  Why would you want
to do that?  Leaving the canonical path separator as '/' (as in my
proposal) avoids changing any code that deals with path separator at
all.


  // Marcus



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC] Canonical Paths

Posted by Marcus Comstedt <ma...@mc.pp.se>.
=?UTF-8?B?QnJhbmtvIMSMaWJlag==?= <br...@xbc.nu> writes:

> I disagree strongly. First, this "denormalized" representation is not
> valid UTF-8.And second, looking for a two-byte sequence is a pain.

But that's just the thing.  You don't have to look for it.  You still
look for the single octet 0x2f as the directory separator.  The
escaping of non-path-separating slashes into a two-byte sequence is
precisely so that you _don't_ find them when looking for path
separators.  See?

And that it's not "valid UTF-8" per se is also a plus, it means that
we won't accidentaly get that sequence when encoding something else.
It's only used internally anyway.  And as I said, Java does the same
thing.


> Now, if you want to do tricks like that: there are two single bytes
> that are invalid in UTF-8: these are 0xfe (11111110) and 0xff
> (11111111), and they also happen to cooperate quite happily with the C
> string functions. We could use one of those as the canonical path
> separator.

Yes, but I don't see why it would be better.  In my opinion, a
sequence that actually decodes as '/' would be the natural choice.
Are we concerned about saving one byte of memory somewhere?


> I hope you do realize, of course, that you can't have '/'s in paths
> anyway, because we still have to be able to generate valid URLs, and
> you can't replace the path separtor there.

"foo%2fbar" is a valid (relative) URL for a path with '/' in it.
It can be observed that the ftp scheme is (or at least was, I haven't
checked in the latest RFCs) specified such that a request for

ftp://ftp.example.com/foo%2fbar/hi%2fho/away%2fwe_go;type=i

should have the operational semantics of

1) connect to ftp.example.com (and log in anonymously)
2) cd to "foo/bar"
3) cd to "hi/ho"
4) get "away/we_go" (in binary mode)

which is pretty much isomorphic to what we'd want for filenames with
'/' in them ("cd to" would correspond be "follow the tree one level
downwards along an arbitrarily named edge"; these operations would be
carried out by the server rather than the client).


   // Marcus



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC] Canonical Paths

Posted by Branko Čibej <br...@xbc.nu>.
Marcus Comstedt wrote:

>Here comes the trick.  Notice that this range includes the range
>[0 .. 127], the ASCII characters.  (In fact all UTF-8 multibyte
>escapes have a range which includes ASCII, since they all start at 0.)
>That is, although an ASCII character such as '/' is normally encoded
>as its ASCII representation (00101111), we could instead encode it as
>11000000 10101111, which would then be a kind of _escaped_ '/',
>distinguishable from (in fact completely unrelated to if you just look
>at single octets) a normal '/' used as path separator.  In the same
>way, we could encode the problematic NUL character as 11000000
>10000000.  In fact, this is exactly what Java does to NUL characters
>when storing them in UTF-8 strings, so there exists a precedent of
>using a scheme like this.
>

I disagree strongly. First, this "denormalized" representation is not 
valid UTF-8.And second, looking for a two-byte sequence is a pain.

Now, if you want to do tricks like that: there are two single bytes that 
are invalid in UTF-8: these are 0xfe (11111110) and 0xff (11111111), and 
they also happen to cooperate quite happily with the C string functions. 
We could use one of those as the canonical path separator.

I hope you do realize, of course, that you can't have '/'s in paths 
anyway, because we still have to be able to generate valid URLs, and you 
can't replace the path separtor there.

-- 
Brane Čibej   <br...@xbc.nu>   http://www.xbc.nu/brane/


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC] Canonical Paths

Posted by Greg Hudson <gh...@MIT.EDU>.
I think I am -1 on fiddling with our UTF-8 decoder to support path
hacks.  It's a layering violation.  (And, as an aside, our UTF-8 decoder
should reject denormalized multibyte sequences if it doesn't already;
otherwise we probably have some nasty bugs.  So, using 0xc0 0xXX would
be relying on a bug in our decoder, and would thus be Not a Clever
Hack.)

And I am -0 on adding any kind of complexity to support '/' or NUL in
path components.  MacOS 9 is on the way out, and most Windows programs
seem to support '/' as a path separator, so pretty much no one can
safely use '/' in path components (and NUL is right out).


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC] Canonical Paths

Posted by Marcus Comstedt <ma...@mc.pp.se>.
Nuutti Kotivuori <na...@iki.fi> writes:

> Argh! I almost shit my pants when I saw this. I was born into UTF-8
> late, so I catched the latest specification, which mentioned this
> explictly and forbade it's use - and I bought it, hook, line and sink
> - swearing to crusify any parser which didn't error out on sequences
> like this and to do the same to anyone and her whole family if
> something generated sequences like it.
> 
> When I think about it objectively - if it's entirely internal to
> Subversion and we control the decoders as well, then who cares, might
> as well do that. It's one alternative.

The other one would be to use 0xfe for non-path-separating slashes and
0xff for embedded NULs (or possibly 0xc0 and 0xc1, respectively), but
I think that would be even more ad-hoc.  It would have the upside that
svn_path_internal_style() and svn_path_local_style() don't have to
change the length of the string though.

Here's a proof of concept patch, using 0xfe/0xff notation, which
should take care of the client side.  (Maybe the server side as well,
I haven't checked).  There still needs to be something in utf.c that
converts the special octets to '/' and NUL on UTF-8 decoding to get
printouts right (could perhaps have been avoided if the 0xc0 0xXX
representaion had been used instead), but that's no biggie.  Or should
printouts use local style?


  // Marcus


Index: subversion/include/svn_path.h
===================================================================
--- subversion/include/svn_path.h
+++ subversion/include/svn_path.h       Thu Aug 29 23:03:07 2002
@@ -44,7 +44,7 @@
 void svn_path_internal_style (svn_stringbuf_t *path);
 
 /* Convert PATH from the canonical internal style to the local
style. */
-void svn_path_local_style (svn_stringbuf_t *path);
+svn_error_t *svn_path_local_style (svn_stringbuf_t *path);
 
 
 /* Join a base path (BASE) with a component (COMPONENT), allocated in POOL.
Index: subversion/libsvn_subr/path.c
===================================================================
--- subversion/libsvn_subr/path.c
+++ subversion/libsvn_subr/path.c	Thu Aug 29 23:08:14 2002
@@ -56,9 +56,23 @@
     {
       /* Convert all local-style separators to the canonical ones. */
       char *p;
-      for (p = path->data; *p != '\0'; ++p)
+      apr_size_t c;
+      for (p = path->data, c = path->len; c--; ++p)
         if (*p == SVN_PATH_LOCAL_SEPARATOR)
           *p = SVN_PATH_SEPARATOR;
+        else if (*p == SVN_PATH_SEPARATOR)
+          *(unsigned char *)p = 0xfe;
+        else if (*p == 0)
+          *(unsigned char *)p = 0xff;
+    }
+  else
+    {
+      /* Only need to handle embedded NULs here */
+      char *p;
+      apr_size_t c;
+      for (p = path->data, c = path->len; c--; ++p)
+        if (*p == 0)
+          *(unsigned char *)p = 0xff;
     }
 
   svn_path_canonicalize (path);
@@ -66,9 +80,12 @@
 }
 
 
-void
+svn_error_t *
 svn_path_local_style (svn_stringbuf_t *path)
 {
+  /* Danger Will Robinson!  Upon return, path may contain embedded NULs.
+     Make sure to check for them before using result as cstring. */
+
   svn_path_canonicalize (path);
   /* FIXME: Should also remove trailing /.'s, if the style says so. */
 
@@ -76,10 +93,35 @@
     {
       /* Convert all canonical separators to the local-style ones. */
       char *p;
-      for (p = path->data; *p != '\0'; ++p)
+      apr_size_t c;
+      for (p = path->data, c = path->len; c--; ++p)
         if (*p == SVN_PATH_SEPARATOR)
           *p = SVN_PATH_LOCAL_SEPARATOR;
+        else if(*(unsigned char *)p == 0xfe)
+          *p = SVN_PATH_SEPARATOR;
+        else if(*(unsigned char *)p == 0xff)
+          *p = 0;
+        else if(*p == SVN_PATH_LOCAL_SEPARATOR)
+          return svn_error_createf(SVN_ERR_BAD_FILENAME, 0, NULL, path->pool,
+                                   "Can't use '%c' as regular path character.",
+                                   SVN_PATH_LOCAL_SEPARATOR);
     }
+  else
+    {
+      /* Just convert embedded NULs and check for path separator as
+         regular path character. */
+      char *p;
+      apr_size_t c;
+      for (p = path->data, c = path->len; c--; ++p)
+        if(*(unsigned char *)p == 0xff)
+          *p = 0;
+        else if(*(unsigned char *)p == 0xfe)
+          return svn_error_createf(SVN_ERR_BAD_FILENAME, 0, NULL, path->pool,
+                                   "Can't use '%c' as regular path character.",
+                                   SVN_PATH_SEPARATOR);
+    }
+
+  return SVN_NO_ERROR;
 }
 
 
@@ -890,6 +932,14 @@
         svn_stringbuf_appendbytes (retstr, path + copied, 
                                    i - copied);
       
+      /* In case the offending character is in fact an placeholder for
+         '/' or NUL, we don't want to encode placeholder, but rather
+         the escaped character itself.  So let's extract it.  */
+      if (c == 0xfe)
+        c = SVN_PATH_SEPARATOR;
+      else if(c == 0xff)
+        c = 0;
+
       /* Now, sprintf() in our escaped character, making sure our
          buffer is big enough to hold the '%' and two digits.  We cast
          the C to unsigned char here because the 'X' format character
@@ -953,6 +1003,10 @@
           digitz[1] = path[++i];
           digitz[2] = '\0';
           c = (char)(strtol (digitz, NULL, 16));
+          if (c == 0)
+            c = 0xff;
+          else if (c == SVN_PATH_SEPARATOR)
+            c = 0xfe;
         }
 
       retstr->data[retstr->len++] = c;

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC] Canonical Paths

Posted by Nuutti Kotivuori <na...@iki.fi>.
Marcus Comstedt wrote:
> Greg Stein <gs...@lyra.org> writes:
>> As a URL, it would be something like:
>> 
>>   http://svn.example.com/repos/subdir/gaz%2fonk
>> 
>> i.e use URL escaping to avoid the '/' interpretation.
>> 
>> However, within our libraries... what to do? Beats the crap outta
>> me. We would need to use/invent an escaping mechanism. Personally,
>> I would simply say that the character is not allowed [on entry to
>> our libs], except as a path separator.
> 
> Hm, I think I may actually have a solution to this problem.
> Slightly hackish, but it should be workable.
> 
> We're using UTF-8 representation of the paths.  In UTF-8, ASCII
> characters (such as '/') are encoded as themselves, a single octet
> with the MSB cleared.  There exists also multibyte sequences of
> length 2-6, with each octet having the MSB set, thus making them
> easily distinguishable from ASCII characters.

Argh! I almost shit my pants when I saw this. I was born into UTF-8
late, so I catched the latest specification, which mentioned this
explictly and forbade it's use - and I bought it, hook, line and sink
- swearing to crusify any parser which didn't error out on sequences
like this and to do the same to anyone and her whole family if
something generated sequences like it.

When I think about it objectively - if it's entirely internal to
Subversion and we control the decoders as well, then who cares, might
as well do that. It's one alternative.

But it still makes my neck hair think I should be a hedgehog.

-- Naked


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC] Canonical Paths

Posted by Marcus Comstedt <ma...@mc.pp.se>.
Russ Allbery <rr...@stanford.edu> writes:

> I'm not sure if this matters for the purposes of your use of this, but
> quite a lot of UTF-8 software will refuse characters like this.  There was

We're talking about the internal path format of Subversion, so we only
have to consider the UTF-8 software "Subversion".  Other apps will
never see it in this format.  Is is either converted to local notation
(if we're to use to locally, for example when creating a file in the
wc), or URL-escaped using the special "fix" I outlined, before passed
to anything else.


> much discussion of this a while back and varient representations have been
> explicitly banned in the UTF-8 spec now, so data containing such sequences
> is invalid UTF-8.

Which, as I said to Brane, is a plus, since then we are guaranteed not
to get them from iconv.


> The justification was security worries about having multiple
> representations for special characters.

Here we don't have that though, since we are canoncializing.  The
special path separator character would always be encoded as 0x2f, and
the slash character _when not used as a path separator_ (conceptually
a different character) would always be encoded as 0xc0 0xaf (of 0xfe,
if we decide that's better).


  // Marcus



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC] Canonical Paths

Posted by Russ Allbery <rr...@stanford.edu>.
Marcus Comstedt <ma...@mc.pp.se> writes:

> Here comes the trick.  Notice that this range includes the range [0
> .. 127], the ASCII characters.  (In fact all UTF-8 multibyte escapes
> have a range which includes ASCII, since they all start at 0.)  That is,
> although an ASCII character such as '/' is normally encoded as its ASCII
> representation (00101111), we could instead encode it as 11000000
> 10101111, which would then be a kind of _escaped_ '/',

I'm not sure if this matters for the purposes of your use of this, but
quite a lot of UTF-8 software will refuse characters like this.  There was
much discussion of this a while back and varient representations have been
explicitly banned in the UTF-8 spec now, so data containing such sequences
is invalid UTF-8.

The justification was security worries about having multiple
representations for special characters.

-- 
Russ Allbery (rra@stanford.edu)             <http://www.eyrie.org/~eagle/>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC] Canonical Paths

Posted by Marcus Comstedt <ma...@mc.pp.se>.
Greg Stein <gs...@lyra.org> writes:

> As a URL, it would be something like:
> 
>   http://svn.example.com/repos/subdir/gaz%2fonk
> 
> i.e use URL escaping to avoid the '/' interpretation.
> 
> However, within our libraries... what to do? Beats the crap outta me. We
> would need to use/invent an escaping mechanism. Personally, I would simply
> say that the character is not allowed [on entry to our libs], except as a
> path separator.

Hm, I think I may actually have a solution to this problem.  Slightly
hackish, but it should be workable.

We're using UTF-8 representation of the paths.  In UTF-8, ASCII
characters (such as '/') are encoded as themselves, a single octet
with the MSB cleared.  There exists also multibyte sequences of length
2-6, with each octet having the MSB set, thus making them easily
distinguishable from ASCII characters.

The shortest multibyte sequence (of length 2) has the following
structure:

110xxxx 10xxxxxx

A header of N (2<=N<=6) ones followed by a zero marks the first octet
of a multibyte sequence of length N, and A header of 10 marks a
continuation octet.  The x bits then hold the actual character code.
So a two octet sequence can represent Unicode characters in the range
[0 .. 2^10-1], or [0 .. 1023].

Here comes the trick.  Notice that this range includes the range
[0 .. 127], the ASCII characters.  (In fact all UTF-8 multibyte
escapes have a range which includes ASCII, since they all start at 0.)
That is, although an ASCII character such as '/' is normally encoded
as its ASCII representation (00101111), we could instead encode it as
11000000 10101111, which would then be a kind of _escaped_ '/',
distinguishable from (in fact completely unrelated to if you just look
at single octets) a normal '/' used as path separator.  In the same
way, we could encode the problematic NUL character as 11000000
10000000.  In fact, this is exactly what Java does to NUL characters
when storing them in UTF-8 strings, so there exists a precedent of
using a scheme like this.

Now, the beauty of it all is that the bulk of the code needs no change
at all.  Code that separates paths by looking for the octet 0x2f can
continue to do so.  And we're still using normal C strings, so
sprintf:ing etc is no problem.  The things that _do_ need to be done to
make it all work are simply the following:

* In the URL escape code, when an 0xc0 octet is encountered, don't
  escape it as %c0.  Instead, take the next octet & 0x3f, and escape
  _that_ as %xx (would become %2f or %00 in the cases at hand, but any
  character in [0 .. 63] can automatically have their potential
  escaped variants handled by the same code at no extra cost.

  Note that a "real" two-byte UTF-8 sequence will never begin with
  0xc0, since the character would have code 0x80 or higher, giving the
  first octet the actual range [0xc2 .. 0xdf].


* On a system where '/' is a nonspecial filename character (such as
  MacOS Classic), canonicalization takes place as follows:

  1) UTF-8 encode the path
  2) Replace all '/' with the escaped variant
  3) Replace the path separator (':') with "real" '/'s.


* Decanonicalization has to be reverse of canonicalization as usual of
  course, so it should do the above but backwards on Mac.


And that's it.  Presto, instant '/' in filenames support (and NULs as
well, if we make a special svn_canonicalize_path() that takes Pascal
strings...)  I must say I have impressed myself with this solution at
least.  ;-)


  // Marcus



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC] Canonical Paths

Posted by Greg Stein <gs...@lyra.org>.
On Thu, Aug 29, 2002 at 12:19:16AM +0100, Philip Martin wrote:
>...
> > But what about the repository URL for a file whose name contains a
> > "/"?  If my repository is rooted at http://www.example.com/repos/foo
> > and I import "bar:gaz/onk" (that is, the file "gaz/onk" in the
> > directory "bar") into it with a client using ':' for
> > SVN_PATH_SEPARATOR, what URL will the imported file get?
> 
> I'm not a HTTP/WebDAV expert, but it would not surprise me if
> filenames containing '/' characters are not valid.  Even if there is a
> valid encoding don't expect the working copy to work on Unix.

As a URL, it would be something like:

  http://svn.example.com/repos/subdir/gaz%2fonk

i.e use URL escaping to avoid the '/' interpretation.

However, within our libraries... what to do? Beats the crap outta me. We
would need to use/invent an escaping mechanism. Personally, I would simply
say that the character is not allowed [on entry to our libs], except as a
path separator.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC] Canonical Paths

Posted by Marcus Comstedt <ma...@mc.pp.se>.
Philip Martin <ph...@codematters.co.uk> writes:

> If you version a file called "." on AmigaOS don't expect the working
> copy to work on Unix.  If you version a file called ":" on Unix don't
> expect the working copy to work on a system with ":" as a directory
> separator. Etc., etc.

Well, sharing WC's between architectures wasn't really what I was
concerned about.  As long as the repository can be shared, I'm fine
with that.


> We had a file called
> 
> packages/freebsd/subversion/files/patch-build::buildcheck.sh
> 
> which was renamed because it failed to checkout on Win32 systems.

As can be expected.  Just as a filename with a greek letter in it
would fail to checkout on a UNIX system using a latin-1 system
locale.  It's up to project policy to decide on filenames so that they
can be checked out on the relevant systems.


> I'm not a HTTP/WebDAV expert, but it would not surprise me if
> filenames containing '/' characters are not valid.

Hm, now that I think about it, encoding "/" in filenames as "%2f" might
actully work.  The file in question would then have the URL
http://www.example.com/repos/foo/bar/gaz%2fonk.  This would be really cool.


> Even if there is a
> valid encoding don't expect the working copy to work on Unix.

As I said, as long as the repository works that's fine.  I shouldn't
be able to check out the particular file "gaz/onk" of course (there
should be a test here so I don't accidentally get something in the
directory "gaz" if such a directory exists), but any other file that
has a filename that's valid on UNIX.


  // Marcus



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC] Canonical Paths

Posted by Philip Martin <ph...@codematters.co.uk>.
Marcus Comstedt <ma...@mc.pp.se> writes:

> Philip Martin <ph...@codematters.co.uk> writes:
> 
> > On that platform we can change SVN_PATH_SEPARATOR to ':' and it may
> > just work.
> 
> So the canonicalization is only canonical within the scope of a
> particular client?  There are no interoperability problems with
> different clients using different values for SVN_PATH_SEPARATOR?
> In that case I'll breathe a little easier.

If you version a file called "." on AmigaOS don't expect the working
copy to work on Unix.  If you version a file called ":" on Unix don't
expect the working copy to work on a system with ":" as a directory
separator. Etc., etc.

We had a file called

packages/freebsd/subversion/files/patch-build::buildcheck.sh

which was renamed because it failed to checkout on Win32 systems.

> But what about the repository URL for a file whose name contains a
> "/"?  If my repository is rooted at http://www.example.com/repos/foo
> and I import "bar:gaz/onk" (that is, the file "gaz/onk" in the
> directory "bar") into it with a client using ':' for
> SVN_PATH_SEPARATOR, what URL will the imported file get?

I'm not a HTTP/WebDAV expert, but it would not surprise me if
filenames containing '/' characters are not valid.  Even if there is a
valid encoding don't expect the working copy to work on Unix.

-- 
Philip Martin

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC] Canonical Paths

Posted by Marcus Comstedt <ma...@mc.pp.se>.
Philip Martin <ph...@codematters.co.uk> writes:

> On that platform we can change SVN_PATH_SEPARATOR to ':' and it may
> just work.

So the canonicalization is only canonical within the scope of a
particular client?  There are no interoperability problems with
different clients using different values for SVN_PATH_SEPARATOR?
In that case I'll breathe a little easier.

But what about the repository URL for a file whose name contains a
"/"?  If my repository is rooted at http://www.example.com/repos/foo
and I import "bar:gaz/onk" (that is, the file "gaz/onk" in the
directory "bar") into it with a client using ':' for
SVN_PATH_SEPARATOR, what URL will the imported file get?


  // Marcus



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC] Canonical Paths

Posted by Philip Martin <ph...@codematters.co.uk>.
Marcus Comstedt <ma...@mc.pp.se> writes:

> Interresting.  And as I noted elsewhere, it could also have "/" in
> filenames since the directory separator is ":", which means the
> canonicalization of "foo:bar/gazonk" to "foo/bar/gazonk" would not be
> reversible.

On that platform we can change SVN_PATH_SEPARATOR to ':' and it may
just work.

-- 
Philip Martin

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC] Canonical Paths

Posted by Marcus Comstedt <ma...@mc.pp.se>.
Nuutti Kotivuori <na...@iki.fi> writes:

> Marcus Comstedt wrote:
> > David Waite <ma...@akuma.org> writes:
> >> How many files in your CVS are named '.'? ;-)
> > 
> > That's beside the point.  I don't like the principle of disallowing
> > some pathnames, regardless of how silly or seldom used they are.  I
> > can live with the system not being able to version control files or
> > directories called '.svn', since obviously the WC needs some name
> > for the storage of its metadata.  But adding '.' to the list of
> > forbidden names when it's easy not to seems evil.
> 
> MacOS supports having NUL-characters in filenames, since they are
> stored as pascal-strings. When does APR support those on MacOS?

Interresting.  And as I noted elsewhere, it could also have "/" in
filenames since the directory separator is ":", which means the
canonicalization of "foo:bar/gazonk" to "foo/bar/gazonk" would not be
reversible.

The ideal canonicalization format would probably be a list of
svn_string, but that might be a bit awkward to work with.


  // Marcus



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC] Canonical Paths

Posted by Nuutti Kotivuori <na...@iki.fi>.
Marcus Comstedt wrote:
> David Waite <ma...@akuma.org> writes:
>> How many files in your CVS are named '.'? ;-)
> 
> That's beside the point.  I don't like the principle of disallowing
> some pathnames, regardless of how silly or seldom used they are.  I
> can live with the system not being able to version control files or
> directories called '.svn', since obviously the WC needs some name
> for the storage of its metadata.  But adding '.' to the list of
> forbidden names when it's easy not to seems evil.

MacOS supports having NUL-characters in filenames, since they are
stored as pascal-strings. When does APR support those on MacOS?

Actually, I'm not trying to be a wiseguy - if it's simple to support a
file named "." on AmigaOS then let's by all means do that.

-- Naked


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC] Canonical Paths

Posted by Marcus Comstedt <ma...@mc.pp.se>.
David Waite <ma...@akuma.org> writes:

> How many files in your CVS are named '.'? ;-)

That's beside the point.  I don't like the principle of disallowing
some pathnames, regardless of how silly or seldom used they are.  I
can live with the system not being able to version control files or
directories called '.svn', since obviously the WC needs some name for
the storage of its metadata.  But adding '.' to the list of forbidden
names when it's easy not to seems evil.


  // Marcus



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC] Canonical Paths

Posted by David Waite <ma...@akuma.org>.
How many files in your CVS are named '.'? ;-)

-David Waite

Marcus Comstedt wrote:

>Philip Martin <ph...@codematters.co.uk> writes:
>
>  
>
>>In the meantime I don't care about this AmigaOS restriction.  Do we
>>have any AmigaOS users, or are you raising a purely hypothetical
>>problem?
>>    
>>
>
>I myself use AmigaOS on a daily basis.  And I run a cvs client on the
>Amiga, which I would have (and want) to replace with a svn client if I
>were to upgrade the CVS server on my Sparc to a svn one.  Adding
>AmigaOS support to APR should be a SMOP if APR is designed correctly
>(which it ought to be since its very purpose is to allow support for
> many OSes).
>
>
>  // Marcus
>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
>For additional commands, e-mail: dev-help@subversion.tigris.org
>
>  
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC] Canonical Paths

Posted by Marcus Comstedt <ma...@mc.pp.se>.
Philip Martin <ph...@codematters.co.uk> writes:

> Marcus Comstedt <ma...@mc.pp.se> writes:
> 
> > What am I missing here?
> 
> Someone to work out which APR functions we need to worry about.
> Someone to go through the Subversion code and make sure that we always
> convert "" to "/" before calling those APR functions.  Feel free to
> send a patch :-) Just don't ask me to do it, I'm not interested.  Like
> I said, I *am* making it easier for someone else to do it.

I have already sent in a patch that adds UTF-8 translations at those
places.  So the places are easy to find, just M-x grep.  Most of them
reside in libsvn_subr/io.c by the way.  As I said earlier, I expect
the correct thing to do would be to make a call to a canonicalization
function wherever a path is converted _to_ UTF-8, and a call to a
decanonicalization function wherever a path is converted _from_
UTF-8.  Then the appropriate handling of path separators and cwd names
can be put into these two functions as needed.  The UTF-8 conversion
calls could also be moved into these two functions of course.

I'm not asking you to do it, I'm just discussing what a good design
should look like.


  // Marcus



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC] Canonical Paths

Posted by Philip Martin <ph...@codematters.co.uk>.
Marcus Comstedt <ma...@mc.pp.se> writes:

> What am I missing here?

Someone to work out which APR functions we need to worry about.
Someone to go through the Subversion code and make sure that we always
convert "" to "/" before calling those APR functions.  Feel free to
send a patch :-) Just don't ask me to do it, I'm not interested.  Like
I said, I *am* making it easier for someone else to do it.

-- 
Philip Martin

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC] Canonical Paths

Posted by Marcus Comstedt <ma...@mc.pp.se>.
Philip Martin <ph...@codematters.co.uk> writes:

> Evidence of a potential user, but one who appears to have not yet
> tried to compile APR, let alone Subversion.  What about neon, or db4,
> you will need at least one of those, do they work?  Looks like a
> hypothetical problem so far :-)

It's not a problem right now, I'll grant you that.  But now is the
time when we can prevent it from becoming a problem later, right?  Can
we gratitiously change the canonical format later without messing up
existing repositories and/or WCs?


> I'm not trying to make life hard for you, I'm making it better.  The
> current path library's use of ".", which is distributed over the file,
> will be factored into two macros and so be easier to change.  You may
> want to look at svn_cl__push_implicit_dot_target as well...
> 
> However, as I said in an earlier mail, we cannot use "" at present.
> If someone wants to fix APR, or wrap all Subversion's APR calls, they
> can do it.  My priority is the systems I use, which happen to run
> Subversion *now*.

I don't really see why we can't use "" now.  When we get paths from
the user, we replace the real directory separator with "/" (which
incidentally might pose a problem on MacOS Classic, if you have files
with "/" in the name), convert the path components to UTF-8, and
replace the real name of the current directory with "".  Upon feeding
the paths to system calls, we do the reverse transformation.
Currently only the UTF-8 transformations are actually carried out, but
all these steps are reversible AFAICT.  What am I missing here?


  // Marcus



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC] Canonical Paths

Posted by Marcus Comstedt <ma...@mc.pp.se>.
=?UTF-8?B?QnJhbmtvIMSMaWJlag==?= <br...@xbc.nu> writes:

> Marcus Comstedt wrote:
> 
> >On a system where '.' denotes the current working directory, then
> >
> [...]
> 
> I'm beginning to think that maybe we _should_ forbid certain
> characters in filenames. Why? because not doing so will create an
> interoperability nightmare for users. My candidates would be:
> 
> 
>     * '/', '\', ':' -- the most common directory separators (never mind
>       the strangeness of VMS paths); ':' also saves us grief on Windows.
>     * '.', '..' and '.svn': these names should be forbidden, for obvious
>       reasons. Actually, I'm not sure about '.svn'; that could be
>       configurable in the long term.
>     * Maybe other chars. Windows forbids '\', '/', ':', '*'. '?'. '<',
>       '>' and '|'.
> 
> I know this may seem restrictive, but we should at least have an
> option to _warn_ about such characters, to save people from further
> grief.

I think not being able to add files containing e.g. ':' on UNIX would
annoy users more.  Having an interoperability warning (that can be
disabled with .svnconfig) sounds potentially useful though.  I'm +0 on
that.


  // Marcus



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC] Canonical Paths

Posted by Karl Fogel <kf...@newton.ch.collab.net>.
Hee hee.

This is fun.  I hope, though, that Philip is now ignoring this
discussion and just doing Whatever He Thinks Is Right in the code.

:-)

-K


Alan Shutko <at...@acm.org> writes:
> Branko Čibej <br...@xbc.nu> writes:
> 
> > I'm beginning to think that maybe we _should_ forbid certain
> > characters in filenames. Why? because not doing so will create an
> > interoperability nightmare for users. My candidates would be:
> 
> I wonder if (some undetermined time in the future) it would be
> desirable to support file renaming on checkout.  If you can't
> represent the filename in your current local and OS, it's renamed on
> checkout by mapping the characters to something that can be
> represented.
> 
> This wouldn't really make sense for code, because your build scripts
> would be looking for the wrong names (unless you coded them in), but
> it could make sense for other documents.
> 
> I'm not sure that this is a good idea, and there would be fiddly cases
> to worry about (like making sure the renamed file didn't conflict
> with a real file), but it might be worth thinking about.
> 
> -- 
> Alan Shutko <at...@acm.org> - In a variety of flavors!
> That is not what room service means on Earth.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
> For additional commands, e-mail: dev-help@subversion.tigris.org

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC] Canonical Paths

Posted by Greg Hudson <gh...@MIT.EDU>.
On Thu, 2002-08-29 at 14:40, Alan Shutko wrote:
> I wonder if (some undetermined time in the future) it would be
> desirable to support file renaming on checkout.

File renaming is what MacOS X does to get compatibility between FFS and
the Unix API on one side (with '/' as a path separator and ':' allowed
in path components) and HFS and the MacOS 9 API on the other side (with
the reverse).

I think it makes sense when you have exactly two conventions to support,
but it doesn't scale when you have three or more.  (And even with two
conventions, Apple acknowledges that you can get user-visible
schizophrenia about what the name of a file is; you have to type
foo/bar:baz into a BSD app but foo:bar/baz into a MacOS 9 app.  For the
most part this isn't a problem because there isn't really such a thing
as a "BSD app", but it's still worrisome.)


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC] Canonical Paths

Posted by Alan Shutko <at...@acm.org>.
Branko Čibej <br...@xbc.nu> writes:

> I'm beginning to think that maybe we _should_ forbid certain
> characters in filenames. Why? because not doing so will create an
> interoperability nightmare for users. My candidates would be:

I wonder if (some undetermined time in the future) it would be
desirable to support file renaming on checkout.  If you can't
represent the filename in your current local and OS, it's renamed on
checkout by mapping the characters to something that can be
represented.

This wouldn't really make sense for code, because your build scripts
would be looking for the wrong names (unless you coded them in), but
it could make sense for other documents.

I'm not sure that this is a good idea, and there would be fiddly cases
to worry about (like making sure the renamed file didn't conflict
with a real file), but it might be worth thinking about.

-- 
Alan Shutko <at...@acm.org> - In a variety of flavors!
That is not what room service means on Earth.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC] Canonical Paths

Posted by Branko Čibej <br...@xbc.nu>.
Marcus Comstedt wrote:

>On a system where '.' denotes the current working directory, then
>
[...]

I'm beginning to think that maybe we _should_ forbid certain characters 
in filenames. Why? because not doing so will create an interoperability 
nightmare for users. My candidates would be:

    * '/', '\', ':' -- the most common directory separators (never mind
      the strangeness of VMS paths); ':' also saves us grief on Windows.
    * '.', '..' and '.svn': these names should be forbidden, for obvious
      reasons. Actually, I'm not sure about '.svn'; that could be
      configurable in the long term.
    * Maybe other chars. Windows forbids '\', '/', ':', '*'. '?'. '<',
      '>' and '|'.

I know this may seem restrictive, but we should at least have an option 
to _warn_ about such characters, to save people from further grief.


-- 
Brane Čibej   <br...@xbc.nu>   http://www.xbc.nu/brane/


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC] Canonical Paths

Posted by Marcus Comstedt <ma...@mc.pp.se>.
cmpilato@collab.net writes:

> Also, I'm wondering something.  If '' is the canonical empty dir, then
> would:
> 
>    svn_path_canonicalize ('./foo')
> 
> return 'foo' (which feels, for unexplained reasons, like an
> information loss) or '/foo' (which is plain wrong) ?

On a system where '.' denotes the current working directory, then
svn_path_canonicalize ('./foo') should definitely return 'foo'.  These
names denote the same object per definition, so there is no information
loss.  And since they are by definition the same they _should_
canonicalize to the same name; this name being the "canonical" name
for the object.  Hence "canonicalization".

On a system where '.' does _not_ mean the current working directory
(but where '/' is still the directory separator), you should get back
'./foo' because foo would then reside in the distinct directory '.'
which is different from the cwd.


  // Marcus



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC] Canonical Paths

Posted by cm...@collab.net.
cmpilato@collab.net writes:

> Also, I'm wondering something.  If '' is the canonical empty dir, then
> would:
> 
>    svn_path_canonicalize ('./foo')
> 
> return 'foo' (which feels, for unexplained reasons, like an
> information loss) or '/foo' (which is plain wrong) ?

Scratch that.  Peace cometh in the quiet moment.  I'm fine with a
'foo' return.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC] Canonical Paths

Posted by cm...@collab.net.
Philip Martin <ph...@codematters.co.uk> writes:

> Strange how things turn out.  I have made the changes to the path
> library, the path regression tests now pass using either "." or "" as
> the empty path (and switching is easy).  However when we come to the
> full regression test "." fails abysmally, whereas "" works.  I've
> looked at a few of the "." failures and it appears that the FS library
> also uses the path library, and it doesn't handle "." as an empty dir.
> 
> So, despite my earlier emails, it looks like I'm going to make "" the
> canonical empty path.  Hey, we all knew that was the really the right
> solution, didn't we ;-)

No libsvn_fs path is allowed to have a '.' component, but that
doesn't mean the filesystem, as a user of the path utility functions,
can't just learn to expect a '.' return value from, say,
svn_path_split() to mean "the path passed in had but one component".
You just have to teach the filesystem code the new semantics, and let
it modify its behavior accordingly.

Also, I'm wondering something.  If '' is the canonical empty dir, then
would:

   svn_path_canonicalize ('./foo')

return 'foo' (which feels, for unexplained reasons, like an
information loss) or '/foo' (which is plain wrong) ?

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC] Canonical Paths

Posted by Philip Martin <ph...@codematters.co.uk>.
Greg Stein <gs...@lyra.org> writes:

> On Wed, Aug 28, 2002 at 11:13:58PM +0100, Philip Martin wrote:
> >...
> > However, as I said in an earlier mail, we cannot use "" at present.
> > If someone wants to fix APR, or wrap all Subversion's APR calls, they
> > can do it.  My priority is the systems I use, which happen to run
> > Subversion *now*.
> 
> As a meta-issue, I just wanted to note that I support this point of view.
> Developers are contributing their time. As a result, they get to choose how
> and where their time is applied. And in a case like this, it is the most
> feasible situation: code for a platform that you use/know, rather than
> guessing at what "is out there."

Strange how things turn out.  I have made the changes to the path
library, the path regression tests now pass using either "." or "" as
the empty path (and switching is easy).  However when we come to the
full regression test "." fails abysmally, whereas "" works.  I've
looked at a few of the "." failures and it appears that the FS library
also uses the path library, and it doesn't handle "." as an empty dir.

So, despite my earlier emails, it looks like I'm going to make "" the
canonical empty path.  Hey, we all knew that was the really the right
solution, didn't we ;-)

-- 
Philip Martin

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC] Canonical Paths

Posted by Greg Stein <gs...@lyra.org>.
On Wed, Aug 28, 2002 at 11:13:58PM +0100, Philip Martin wrote:
>...
> However, as I said in an earlier mail, we cannot use "" at present.
> If someone wants to fix APR, or wrap all Subversion's APR calls, they
> can do it.  My priority is the systems I use, which happen to run
> Subversion *now*.

As a meta-issue, I just wanted to note that I support this point of view.
Developers are contributing their time. As a result, they get to choose how
and where their time is applied. And in a case like this, it is the most
feasible situation: code for a platform that you use/know, rather than
guessing at what "is out there."

[ of course, if they apply their time to coding half-completed crap, cuz
  "new" is more fun than "completing" then we just revoke their commit
  access and be done with it :-) ]


You go, Philip! :-)

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC] Canonical Paths

Posted by Philip Martin <ph...@codematters.co.uk>.
Marcus Comstedt <ma...@mc.pp.se> writes:

> Philip Martin <ph...@codematters.co.uk> writes:
> 
> > In the meantime I don't care about this AmigaOS restriction.  Do we
> > have any AmigaOS users, or are you raising a purely hypothetical
> > problem?
> 
> I myself use AmigaOS on a daily basis.  And I run a cvs client on the
> Amiga, which I would have (and want) to replace with a svn client if I
> were to upgrade the CVS server on my Sparc to a svn one.  Adding
> AmigaOS support to APR should be a SMOP if APR is designed correctly
> (which it ought to be since its very purpose is to allow support for
>  many OSes).

Evidence of a potential user, but one who appears to have not yet
tried to compile APR, let alone Subversion.  What about neon, or db4,
you will need at least one of those, do they work?  Looks like a
hypothetical problem so far :-)

I'm not trying to make life hard for you, I'm making it better.  The
current path library's use of ".", which is distributed over the file,
will be factored into two macros and so be easier to change.  You may
want to look at svn_cl__push_implicit_dot_target as well...

However, as I said in an earlier mail, we cannot use "" at present.
If someone wants to fix APR, or wrap all Subversion's APR calls, they
can do it.  My priority is the systems I use, which happen to run
Subversion *now*.

-- 
Philip Martin

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC] Canonical Paths

Posted by Marcus Comstedt <ma...@mc.pp.se>.
Philip Martin <ph...@codematters.co.uk> writes:

> In the meantime I don't care about this AmigaOS restriction.  Do we
> have any AmigaOS users, or are you raising a purely hypothetical
> problem?

I myself use AmigaOS on a daily basis.  And I run a cvs client on the
Amiga, which I would have (and want) to replace with a svn client if I
were to upgrade the CVS server on my Sparc to a svn one.  Adding
AmigaOS support to APR should be a SMOP if APR is designed correctly
(which it ought to be since its very purpose is to allow support for
 many OSes).


  // Marcus



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC] Canonical Paths

Posted by Philip Martin <ph...@codematters.co.uk>.
Marcus Comstedt <ma...@mc.pp.se> writes:

> If you ask me, OTOH, "." is the wrong answer.  If you canonicalize the
> current directory to ".", what are you going to canonicalize a
> reference to a regular file called "." to on a system that supports
> such files (like AmigaOS)?  I don't know of any system that allows ""
> as the name of a regular file.

I intend to modify the path library so that "." only occurs explicitly
in one or two places Then if anyone fixes APR to handle "", or someone
(an AmigaOS user?) wraps all the Subversion APR calls, changing "." to
"" should be simple.

In the meantime I don't care about this AmigaOS restriction.  Do we
have any AmigaOS users, or are you raising a purely hypothetical
problem?

-- 
Philip Martin

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC] Canonical Paths

Posted by Marcus Comstedt <ma...@mc.pp.se>.
Karl Fogel <kf...@newton.ch.collab.net> writes:

> Marcus Comstedt <ma...@mc.pp.se> writes:
> > If you ask me, OTOH, "." is the wrong answer.  If you canonicalize the
> > current directory to ".", what are you going to canonicalize a
> > reference to a regular file called "." to on a system that supports
> > such files (like AmigaOS)?  I don't know of any system that allows ""
> > as the name of a regular file.
> 
> But does APR even work on such systems?

If not, that's a problem with APR (or with the choice to use APR,
depending on your standpoint) that has to be fixed, not a reason
to propagate such problems into the code of Subversion.  What good is
a Portable Runtime if it's not portable?

  // Marcus


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC] Canonical Paths

Posted by Karl Fogel <kf...@newton.ch.collab.net>.
Marcus Comstedt <ma...@mc.pp.se> writes:
> If you ask me, OTOH, "." is the wrong answer.  If you canonicalize the
> current directory to ".", what are you going to canonicalize a
> reference to a regular file called "." to on a system that supports
> such files (like AmigaOS)?  I don't know of any system that allows ""
> as the name of a regular file.

But does APR even work on such systems?

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC] Canonical Paths

Posted by Marcus Comstedt <ma...@mc.pp.se>.
Justin Erenkrantz <je...@apache.org> writes:

> > 3. The canonical form for the path representing the current directory
> >    could be either "." or "".  Currently the path library is not
> >    consistent.  We should pick one, and then we do not need to handle
> >    the other.  APR is know to have some problems using "", so "."
> >    would be the one I would choose.
> 
> Yeah, "." is the right answer if you ask me.

If you ask me, OTOH, "." is the wrong answer.  If you canonicalize the
current directory to ".", what are you going to canonicalize a
reference to a regular file called "." to on a system that supports
such files (like AmigaOS)?  I don't know of any system that allows ""
as the name of a regular file.


  // Marcus



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC] Canonical Paths

Posted by Marcus Comstedt <ma...@mc.pp.se>.
Philip Martin <ph...@codematters.co.uk> writes:

> UTF-8 already happens in svn_cl__args_to_target_array.

So that is probably where the paths should be canonicalized as well.
Since UTF-8 is part of the "canonical path specification",
canonicalizing and native->UTF-8 conversion should always happen at
the same places, and decanonicalizing and UTF-8->native conversion
should happen at the same places.  Since there (hopefully) already are
UTF-8 conversions everywhere there needs to be, this is a simple way
of finding where to insert canonicalizations and decanonicalizations.


  // Marcus



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC] Canonical Paths

Posted by Philip Martin <ph...@codematters.co.uk>.
Justin Erenkrantz <je...@apache.org> writes:

> > 5. In some places the path library handles NULL paths.  This is
> >    unnecessary, as all inputs should be canonical, and so should be
> >    removed (fixing any breakage that results).  
> 
> What is the interpretation of a NULL path?

Some functions (but not all) check for a NULL char * pointer, that's
not a canonical path.  The inconsistency is not a good thing.

> 
> > One question: svn_path_internal_style, which converts from native
> > separators to canonical separators, doesn't seem to be used.  Should
> > the application pass all user input paths through this function?
> 
> I think so.  (should it also UTF-8 encode/decode it?)  -- justin

UTF-8 already happens in svn_cl__args_to_target_array.

-- 
Philip Martin

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

RE: [RFC] Canonical Paths

Posted by Barry Scott <ba...@ntlworld.com>.
> Yeah, "." is the right answer if you ask me.

Doesn't APR have functions that determine what a string means for
the OS and file systems its implemented on? The "." surely is
a detail hidden inside APR. As is the problem of joining to a path
etc.

What is svn's approach to naming files on file systems where the
file's name in the repos cannot be created?

And to answer my own question after adding "bar\file.txt" from FreeBSD:

> "c:\Program Files\Subversion\svn.exe" up
svn: The system cannot find the path specified.
svn: could not save file
svn: svn_io_file_open: can't open
`./.svn/tmp/text-base/bar\file.txt.svn-base'

	BArry



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC] Canonical Paths

Posted by Justin Erenkrantz <je...@apache.org>.
On Wed, Aug 28, 2002 at 09:12:46PM +0100, Philip Martin wrote:
> 1. The Subversion libraries only handle canonical paths, and all the
>    functions should passed canonical paths. The only exceptions to
>    this rule are svn_path_internal_style, svn_path_canonicalize and
>    svn_path_canonicalize_nts, which obviously can accept non-canonical
>    paths.

+1.

> 2. The application is responsible for canonicalizing all user input
>    paths before passing them to the Subversion.

+1.  (I wish there were a way to enforce this in code.)

> 3. The canonical form for the path representing the current directory
>    could be either "." or "".  Currently the path library is not
>    consistent.  We should pick one, and then we do not need to handle
>    the other.  APR is know to have some problems using "", so "."
>    would be the one I would choose.

Yeah, "." is the right answer if you ask me.

> 4. Functions like svn_path_split, svn_path_remove_component, etc
>    should only return canonical paths.

And, I think that's been the biggest problem so far - they don't.

> 5. In some places the path library handles NULL paths.  This is
>    unnecessary, as all inputs should be canonical, and so should be
>    removed (fixing any breakage that results).  

What is the interpretation of a NULL path?

> One question: svn_path_internal_style, which converts from native
> separators to canonical separators, doesn't seem to be used.  Should
> the application pass all user input paths through this function?

I think so.  (should it also UTF-8 encode/decode it?)  -- justin

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC] Canonical Paths

Posted by Marcus Comstedt <ma...@mc.pp.se>.
Dave Cridland <da...@cridland.net> writes:

> I almost hate to contribute to this discussion, given that I'm not
> entirely sure it's not a bike-shed. However, I'm just throwing ideas
> about. I will happily code the relevant support code up for this idea if
> there's the slightest interest. I don't think I'd have a clue which
> functions actually need canonical paths, though.
> 
> What about defining, in code terms, a "Canonical Path" as a structure
> (or object, in those language bindings which support them):
> 
> typedef struct {
> 	char * s;
> 	size_t l;
> } SVN_PATH_COMPONENT;
> 
> typedef struct {
> 	SVN_PATH_COMPONENT * c;
> 	size_t l;
> } SVN_PATH_CANONICAL;
> 
> const SVN_PATH_COMPONENT svn_path_cwd = { 0, 0 };
> const SVN_PATH_COMPONENT svn_path_repository_root = { 0, 1 };
> 
> Gives you type safety, such that it's impossible to pass a non-canonical
> path into a function expecting a canonical one, and allows for
> canonicalization of local paths to be based on any path seperators and
> cwd semantics you want.
> 
> You do, of course, have to build some extra functions for making a "DAV
> canonical path" - URL encode each component, and glue them with '/' -
> and a "Repository Canonical Path", but discussions seem to be hinting
> that these are (potentially) different anyway.
> 
> Thoughts?


The problem with changing the representation of paths is that paths
are being passed around in a _lot_ of code, and this code would then
have to be changed (at least as far as declarations are concerned).

The point is probably moot now though, since my "UTF-8 hack" should
give us all the degrees of freedom we would like while still retaining
exactly the same representation of paths (UTF-8 encoded component
names separated by the octet 0x2f in a C-string).


  // Marcus



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [RFC] Canonical Paths

Posted by Dave Cridland <da...@cridland.net>.
I almost hate to contribute to this discussion, given that I'm not
entirely sure it's not a bike-shed. However, I'm just throwing ideas
about. I will happily code the relevant support code up for this idea if
there's the slightest interest. I don't think I'd have a clue which
functions actually need canonical paths, though.

What about defining, in code terms, a "Canonical Path" as a structure
(or object, in those language bindings which support them):

typedef struct {
	char * s;
	size_t l;
} SVN_PATH_COMPONENT;

typedef struct {
	SVN_PATH_COMPONENT * c;
	size_t l;
} SVN_PATH_CANONICAL;

const SVN_PATH_COMPONENT svn_path_cwd = { 0, 0 };
const SVN_PATH_COMPONENT svn_path_repository_root = { 0, 1 };

Gives you type safety, such that it's impossible to pass a non-canonical
path into a function expecting a canonical one, and allows for
canonicalization of local paths to be based on any path seperators and
cwd semantics you want.

You do, of course, have to build some extra functions for making a "DAV
canonical path" - URL encode each component, and glue them with '/' -
and a "Repository Canonical Path", but discussions seem to be hinting
that these are (potentially) different anyway.

Thoughts?

Dave.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org