You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Karl Fogel <kf...@galois.collab.net> on 2001/02/15 17:45:39 UTC

Re: parsing paths

Jim Blandy <ji...@zwingli.cygnus.com> writes:
> I'm not saying the libsvn_subr function should be changed to behave
> this way.  But I know people will tell me I'm duplicating code, so I'm
> just explaining the behavior I require.  If libsvn_subr wants to
> provide it, groovy for me.  I don't feel that I know its clientele
> well enough to make changes there myself.

As far as I know, the other users of svn_path_* also depend on N
consecutive slashes being equivalent to a single slash, and also
ignore trailing slashes.  Ben, you've more experience with those
callers lately though, please correct the above if necessary.

If it's all true, we can change the path library to do what you need,
Jim, and you can get rid of the duplicated code.

Below is Jim's comment from svn_fs.h:

/* Here are the rules for directory entry names, and directory paths:

   A directory entry name is a Unicode string encoded in UTF-8, and
   may not contain the null character (U+0000).  The name should be in
   Unicode canonical decomposition and ordering.  No directory entry
   may be named '.', '..', or the empty string.  Given a directory
   entry name which fails to meet these requirements, a filesystem
   function returns an SVN_ERR_FS_PATH_SYNTAX error.

   A directory path is a sequence of one or more directory entry
   names, separated by slash characters (U+002f).  Sequences of two or
   more consecutive slash characters are treated as if they were a
   single slash.  If a path ends with a slash, it refers to the same
   node it would without the slash, but that node must be a directory,
   or else the function returns an SVN_ERR_FS_NOT_DIRECTORY error.

   Paths may not start with a slash.  All directory paths in
   Subversion are relative; all functions that expect a path as an
   argument also expect a directory the path should be interpreted
   relative to.  If a function receives a path that begins with a
   slash, it will return an SVN_ERR_FS_PATH_SYNTAX error.  */

-K

Jim Blandy <ji...@zwingli.cygnus.com> writes:
> I've just committed some path-traversal code to libsvn_fs/tree.c which
> includes a function next_entry_name for parsing paths.
> 
> I know there are existing functions for dealing with paths in
> libsvn_subr, but they don't work the way I want.  In particular:
> 
> - They leave it to the caller to interpret series of two or more
>   consecutive slashes.  The filesystem interface defines how they must
>   be interpreted (see svn_fs.h, "Directory entry names and directory
>   paths"), requiring that they be handled as in POSIX: multiple
>   slashes are equivalent to one slash.
> 
> - They don't provide any convenient way to handle trailing slashes.
>   The filesystem also defines what this means, consistent with the
>   usage in the GNU fileutils (and, I think, POSIX).  next_entry_name
>   does something that is easy to deal with.
> 
> I'm not saying the libsvn_subr function should be changed to behave
> this way.  But I know people will tell me I'm duplicating code, so I'm
> just explaining the behavior I require.  If libsvn_subr wants to
> provide it, groovy for me.  I don't feel that I know its clientele
> well enough to make changes there myself.

Re: Unicode directory entries

Posted by Jim Blandy <ji...@zwingli.cygnus.com>.
Another problem with that text (defining how Unicode should be used in
filenames) is that it places the onus on the caller to put the name in
the right form.  The filesystem doesn't actually check the form.  As a
consequence, if somebody does it the wrong way, you get a mess.  The
comment places the blame outside the filesystem, but that doesn't help
the poor user.

So the filesystem should either check that the filenames are properly
decomposed, or normalize them itself.  The latter would be the easist
to use, but more work.

Surely there are libraries for this.  But I don't know where they are.

Re: Unicode directory entries

Posted by Jim Blandy <ji...@zwingli.cygnus.com>.
> > > * What do you mean by ordering? It didn't sound like you were
> talking
> > > about a sorting order...
> 
> > No --- I was trying to refer to the ordering of the modifiers.  It
> > sounds like that is subsumed by the normalization form requirements
> > you mention above.
> 
> > What I'm trying to do is put directory entries in some canonical form,
> > so that directory entries don't become mysteriously invisible because
> > different users chose different compositions/decompositions.  What
> > would you recommend that I say?
> 
> Well, since XML wants Form C, it seems to make sense for us to use
> Form C as well.

Form C it is.  I'll read TR15.  Thanks very much for catching this.

RE: Unicode directory entries

Posted by Bill Tutt <ra...@lyra.org>.
From: Jim Blandy [mailto:jimb@zwingli.cygnus.com] 
> "Bill Tutt" <ra...@lyra.org> writes:
> > Several comments/questions:
> > * By Unicode canonical decomposition, do you mean Normalization Form
D
> > as noted in TR15? (http://www.unicode.org/unicode/reports/tr15/) 
> 
> > I ask because canonical decomposition results in all combined
composite
> > characters being expanded into their component forms. i.e. A
composite
> > umlauted lower case u turns into two characters. An umlaut followed
by a
> > lowercase u. I ask, because you really wouldn't want to implement
the
> > wrong normalization algorithm. :) TR15 also states the following:
> > 
> > "The W3C Character Model for the World Wide Web [CharMod] requires
the
> > use of Normalization Form C for XML and related standards (this
document
> > is not yet final, but this requirement is not expected to change).
See
> > the W3C Requirements for String Identity, Matching, and String
Indexing
> > [CharReq] for more background."

> I was punting.  I knew that there were several ways to represent
> composite characters, and assumed that there was some form recommended
> for use in names that needed to be matched.  From what you say, it
> sounds like there are several.  (Joy.)

Well, there are 4 different ways to handle composite characters.
Form D: Canonical Decomposition.
Form C: Canonical Decomposition followed by Canonical Composition.
Form KD: Compatible Decomposition.
Form KC: Compatible Decomposition followed by Compatible Composition.


> > * What do you mean by ordering? It didn't sound like you were
talking
> > about a sorting order...

> No --- I was trying to refer to the ordering of the modifiers.  It
> sounds like that is subsumed by the normalization form requirements
> you mention above.

> What I'm trying to do is put directory entries in some canonical form,
> so that directory entries don't become mysteriously invisible because
> different users chose different compositions/decompositions.  What
> would you recommend that I say?

Well, since XML wants Form C, it seems to make sense for us to use
Form C as well.

> Another problem with that text (defining how Unicode should be used in
> filenames) is that it places the onus on the caller to put the name in
> the right form.  The filesystem doesn't actually check the form.  As a
> consequence, if somebody does it the wrong way, you get a mess.  The
> comment places the blame outside the filesystem, but that doesn't help
> the poor user.

> So the filesystem should either check that the filenames are properly
> decomposed, or normalize them itself.  The latter would be the easist
> to use, but more work.

> Surely there are libraries for this.  But I don't know where they are.

WRT to what to do about the problem:
There are two immediate sources of code that I know of to help deal
with the problem:

http://www.unicode.org/unicode/reports/tr15/
One is the technical report itself. It provides some
sample/non-optimal code for how to do the appropriate logic, as well
as some possible optimization hints. (esp. if you only want to verify
that it complies with the normalization form, as opposed to actually
normalizing the string.)

http://oss.software.ibm.com/developerworks/opensource/icu/project/
The second is IBM's ICU project, but IIRC
this has a fairly funky license.

An ancillary source of Unicode code (at least in terms of having a
nice small copy of the Unicode character database) is how Python goes
about it. Greg or I can dig up the appropriate part of that code if it 
becomes necessary.

Bill

Re: Unicode directory entries

Posted by Jim Blandy <ji...@zwingli.cygnus.com>.
"Bill Tutt" <ra...@lyra.org> writes:
> Several comments/questions:
> * By Unicode canonical decomposition, do you mean Normalization Form D
> as noted in TR15? (http://www.unicode.org/unicode/reports/tr15/) 
> 
> I ask because canonical decomposition results in all combined composite
> characters being expanded into their component forms. i.e. A composite
> umlauted lower case u turns into two characters. An umlaut followed by a
> lowercase u. I ask, because you really wouldn't want to implement the
> wrong normalization algorithm. :) TR15 also states the following:
> 
> "The W3C Character Model for the World Wide Web [CharMod] requires the
> use of Normalization Form C for XML and related standards (this document
> is not yet final, but this requirement is not expected to change). See
> the W3C Requirements for String Identity, Matching, and String Indexing
> [CharReq] for more background."

I was punting.  I knew that there were several ways to represent
composite characters, and assumed that there was some form recommended
for use in names that needed to be matched.  From what you say, it
sounds like there are several.  (Joy.)

> * What do you mean by ordering? It didn't sound like you were talking
> about a sorting order...

No --- I was trying to refer to the ordering of the modifiers.  It
sounds like that is subsumed by the normalization form requirements
you mention above.

What I'm trying to do is put directory entries in some canonical form,
so that directory entries don't become mysteriously invisible because
different users chose different compositions/decompositions.  What
would you recommend that I say?

re: Unicode directory entries

Posted by Bill Tutt <ra...@lyra.org>.
Karl quoted svn_fs.h:
> Below is Jim's comment from svn_fs.h:

[snippet]
The [directory] name should be in Unicode canonical decomposition and
ordering.  No directory entry may be named '.', '..', or the empty
string.  Given a directory entry name which fails to meet these
requirements, a filesystem function returns an SVN_ERR_FS_PATH_SYNTAX
error.
[end snippet]

Several comments/questions:
* By Unicode canonical decomposition, do you mean Normalization Form D
as noted in TR15? (http://www.unicode.org/unicode/reports/tr15/) 

I ask because canonical decomposition results in all combined composite
characters being expanded into their component forms. i.e. A composite
umlauted lower case u turns into two characters. An umlaut followed by a
lowercase u. I ask, because you really wouldn't want to implement the
wrong normalization algorithm. :) TR15 also states the following:

"The W3C Character Model for the World Wide Web [CharMod] requires the
use of Normalization Form C for XML and related standards (this document
is not yet final, but this requirement is not expected to change). See
the W3C Requirements for String Identity, Matching, and String Indexing
[CharReq] for more background."

* What do you mean by ordering? It didn't sound like you were talking
about a sorting order...

Thanks,
Bill