You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Garret Wilson <ga...@globalmentor.com> on 2012/01/20 19:38:28 UTC

URI-encoding on 1.7 repository?

What is the canonical way to encode filenames, both in the API and in 
the underlying FSFS in a Subversion 1.7 repository?

Let's say I have the file "a b.txt", which consists of "a" and "b" with 
a space in between. How should this be stored on the server? How should 
the various APIs give it to me?

Let me explain further. If I commit a file on Windows 7 Professional 64 
bit on an NTFS partition using TortoiseSVN, and then turn around and 
read that repository using SVNKit, the SVNDirEntry.getRelativePath() 
gives me "a b.txt". I don't know if on the back-end these files are 
being stored as "a b.txt", or if they are being stored in canonical URI 
form (i.e. "a%20b.txt") and SVNKit is just being "helpful" by decoding them.

 From my end I'm actually starting with 100% canonically-encoded URIs to 
begin with. If Subversion is storing these things in decoded form on the 
back end, does it compensate for characters not supported by the 
underlying file system? So when I take my URI and I decode it just so I 
can save the filename the way Subversion likes, how do I know which 
characters to decode (those supported by the underlying file system---as 
if I, the client know what that is!) but which characters to leave 
encoded (those not supported by the underlying file system on the server)?

Maybe someone can set me straight here. I'm hoping that Subversion 
stores everything in correctly UTF-8 encoded and escaped URIs in the 
back-end and in its APIs, and that the real culprit here is SVNKit for 
being "helpful" and decoding the strings for me without asking. Or I 
suppose the other option that would work almost as well is if everything 
on the back-end was stored in decoded form, but some tricks are pulled 
so that /all/ characters are supported, regardless of the underlying 
file system. The case I don't want to end up in is where I have to 
encode some characters but not others based upon some file system 
implementation I don't know about on the server.

Thanks for shedding some light on this.

Garret

Re: URI-encoding on 1.7 repository?

Posted by Garret Wilson <ga...@globalmentor.com>.
Oh, and it case I wasn't clear, I'm referring to a Subversion 
repository, not a local copy. And I'm referring to the top-most API. If 
some of the lower layers are more restrictive than the top-most API, 
then they should use some encoding scheme (what, I don't care) to shield 
this platform-specific restriction from the top-level API---which is 
what I thought Daniel was saying at first.

Garret


On 1/20/2012 7:28 PM, Garret Wilson wrote:
> On 1/20/2012 7:00 PM, Daniel Shahaf wrote:
>> Garret Wilson wrote on Fri, Jan 20, 2012 at 18:18:24 -0800:
>>> On 1/20/2012 6:14 PM, Daniel Shahaf wrote:
>>>> You don't care what FS backend the server runs. All you care is
>>>> that the endpoint of svn_ra_open4() implements the Subversion RA
>>>> API properly. Normal Subversion servers use svn_fs.h which in turn
>>>> presents the same API _regardless of which backend is used_. I'll
>>>> spell it out: the notion of 'valid pathname in a Subversion
>>>> filesystem' does not depend on the FS backend in use.
>>> All that is good news. So I guess the important question is: what
>>> spells out "the notion of 'valid pathname on a Subversion
>>> filesystem'"? Is it "any valid Unicode code point?" What I'm getting
>> See my previous reply.
>
> Right. So your previous reply said that a "valid pathname" is the same 
> on all platforms, and that the underlying implementation will take 
> care of the details. I'm asking what are the rules for a "valid 
> pathname". I'm glad that these rules are the same across all 
> platforms, but I don't know what the rules are. In other words, what 
> goes in the following function?
>
> boolean isValidSubversionPathname(String pathname);
>
>
>>
>>> at is that I need to know which characters, if any, I need to encode
>>> before passing them to Subversion. If Subversion supports any
>>> Unicode character, I can just pass the path decoded and sleep
>>> soundly at night. If not, I need to know which ones to decode and
>>> which ones to pass through.
>> Err, that depends on what API layer you're working with.  (For example:
>> svn_fs.h is perfectly happy with :,*,\n as part of the basename, but
>> libsvn_wc on windows, and the mergeinfo logic, aren't.)
>
> Oh, that's bad news. In your previous reply you said, "the notion of 
> 'valid pathname in a Subversion
> filesystem' does not depend on the FS backend in use." Now you seem to 
> say "whether some pathname is valid or not it depends on whether you 
> 're on Windows or some other platform." (Even worse, you seem to be 
> saying that the notion of "valid pathname" isn't even consistent 
> across the API.)
>
>> And 'what to encode/decode' is a rather vague question.  I'm not sure if
>> it means "Does `svn info uri:///foo bar` == `svn info uri:///foo%20bar`?"
>> or something else.  Can you be more concrete?
>
> It doesn't matter. It's some black box that works like this:
>
> String encode(String input);
> String decode(String output);
>
> I can come up with a thousand ways to encode/decode. I can use %hh. I 
> can use ^0xhh. The only two requirements are that 1) encode() provides 
> me with a string guaranteed to be a valid pathname, and 2) decode() 
> will take the encoded string and give me back the decoded string I 
> started with.
>
> But to meet requirement #1, I have to know which characters are 
> considered valid and which aren't. That's what I don't know, and 
> that's what I'm asking:
>
>  1. Does the API guarantee that a "valid pathname" (whatever that is)
>     is the same across all platforms? I thought you said yes, but now
>     it seems you're saying no. (If you say "no", then there's no point
>     in answering question 2, because we're stuck---I can write code
>     that may work with one repository on one platform, but suddenly
>     fail when I move the same data to another platform.)
>  2. What is the definition of "valid pathname"? Is it any Unicode
>     character? Is it only XML name characters? Is it any Unicode
>     character except control characters and NULL (\u0000)?
>
> Sorry if I'm not clear. It's a very simple question, and I hope I'm 
> not making it more complicated than it is.
>
> Think about it this way: pretend you have an XML document with the 
> element <a-b>. You to walk the DOM of that document on Windows, and it 
> works fine. But you try process the DOM on a Mac, it breaks, with your 
> XML processor saying, "sorry, an XML name cannot have a '-' 
> character". That will never happen. Why? Because (these are analogous 
> questions to the ones above concerning Subversion):
>
>  1. The XML specification guarantees that all XML processors agree on
>     what an XML name is.
>  2. Specifically, an XML name is composed of a NameStartChar followed
>     by any NameChar, as defined here:
>     http://www.w3.org/TR/REC-xml/#NT-Name
>
> Does that make sense? Can we answer those same two questions 
> concerning Subversion pathnames?
>
> Garret
>

Re: URI-encoding on 1.7 repository?

Posted by Garret Wilson <ga...@globalmentor.com>.
On 1/20/2012 7:00 PM, Daniel Shahaf wrote:
> Garret Wilson wrote on Fri, Jan 20, 2012 at 18:18:24 -0800:
>> On 1/20/2012 6:14 PM, Daniel Shahaf wrote:
>>> You don't care what FS backend the server runs. All you care is
>>> that the endpoint of svn_ra_open4() implements the Subversion RA
>>> API properly. Normal Subversion servers use svn_fs.h which in turn
>>> presents the same API _regardless of which backend is used_. I'll
>>> spell it out: the notion of 'valid pathname in a Subversion
>>> filesystem' does not depend on the FS backend in use.
>> All that is good news. So I guess the important question is: what
>> spells out "the notion of 'valid pathname on a Subversion
>> filesystem'"? Is it "any valid Unicode code point?" What I'm getting
> See my previous reply.

Right. So your previous reply said that a "valid pathname" is the same 
on all platforms, and that the underlying implementation will take care 
of the details. I'm asking what are the rules for a "valid pathname". 
I'm glad that these rules are the same across all platforms, but I don't 
know what the rules are. In other words, what goes in the following 
function?

boolean isValidSubversionPathname(String pathname);


>
>> at is that I need to know which characters, if any, I need to encode
>> before passing them to Subversion. If Subversion supports any
>> Unicode character, I can just pass the path decoded and sleep
>> soundly at night. If not, I need to know which ones to decode and
>> which ones to pass through.
> Err, that depends on what API layer you're working with.  (For example:
> svn_fs.h is perfectly happy with :,*,\n as part of the basename, but
> libsvn_wc on windows, and the mergeinfo logic, aren't.)

Oh, that's bad news. In your previous reply you said, "the notion of 
'valid pathname in a Subversion
filesystem' does not depend on the FS backend in use." Now you seem to 
say "whether some pathname is valid or not it depends on whether you 're 
on Windows or some other platform." (Even worse, you seem to be saying 
that the notion of "valid pathname" isn't even consistent across the API.)

> And 'what to encode/decode' is a rather vague question.  I'm not sure if
> it means "Does `svn info uri:///foo bar` == `svn info uri:///foo%20bar`?"
> or something else.  Can you be more concrete?

It doesn't matter. It's some black box that works like this:

String encode(String input);
String decode(String output);

I can come up with a thousand ways to encode/decode. I can use %hh. I 
can use ^0xhh. The only two requirements are that 1) encode() provides 
me with a string guaranteed to be a valid pathname, and 2) decode() will 
take the encoded string and give me back the decoded string I started with.

But to meet requirement #1, I have to know which characters are 
considered valid and which aren't. That's what I don't know, and that's 
what I'm asking:

 1. Does the API guarantee that a "valid pathname" (whatever that is) is
    the same across all platforms? I thought you said yes, but now it
    seems you're saying no. (If you say "no", then there's no point in
    answering question 2, because we're stuck---I can write code that
    may work with one repository on one platform, but suddenly fail when
    I move the same data to another platform.)
 2. What is the definition of "valid pathname"? Is it any Unicode
    character? Is it only XML name characters? Is it any Unicode
    character except control characters and NULL (\u0000)?

Sorry if I'm not clear. It's a very simple question, and I hope I'm not 
making it more complicated than it is.

Think about it this way: pretend you have an XML document with the 
element <a-b>. You to walk the DOM of that document on Windows, and it 
works fine. But you try process the DOM on a Mac, it breaks, with your 
XML processor saying, "sorry, an XML name cannot have a '-' character". 
That will never happen. Why? Because (these are analogous questions to 
the ones above concerning Subversion):

 1. The XML specification guarantees that all XML processors agree on
    what an XML name is.
 2. Specifically, an XML name is composed of a NameStartChar followed by
    any NameChar, as defined here: http://www.w3.org/TR/REC-xml/#NT-Name

Does that make sense? Can we answer those same two questions concerning 
Subversion pathnames?

Garret


Re: URI-encoding on 1.7 repository?

Posted by Daniel Shahaf <da...@elego.de>.
Garret Wilson wrote on Fri, Jan 20, 2012 at 19:27:12 -0800:
> On 1/20/2012 7:00 PM, Daniel Shahaf wrote:
> >Garret Wilson wrote on Fri, Jan 20, 2012 at 18:18:24 -0800:
> >>On 1/20/2012 6:14 PM, Daniel Shahaf wrote:
> >>>You don't care what FS backend the server runs. All you care is
> >>>that the endpoint of svn_ra_open4() implements the Subversion RA
> >>>API properly. Normal Subversion servers use svn_fs.h which in turn
> >>>presents the same API _regardless of which backend is used_. I'll
> >>>spell it out: the notion of 'valid pathname in a Subversion
> >>>filesystem' does not depend on the FS backend in use.
> >>All that is good news. So I guess the important question is: what
> >>spells out "the notion of 'valid pathname on a Subversion
> >>filesystem'"? Is it "any valid Unicode code point?" What I'm getting
> >See my previous reply.
> 
> Right. So your previous reply said that a "valid pathname" is the
> same on all platforms, and that the underlying implementation will
> take care of the details. I'm asking what are the rules for a "valid
> pathname". I'm glad that these rules are the same across all
> platforms, but I don't know what the rules are. In other words, what
> goes in the following function?
> 
> boolean isValidSubversionPathname(String pathname);
> 

http://svn.haxx.se/dev/archive-2012-01/0292.shtml -> 
https://svn.apache.org/repos/asf/subversion/tags/1.7.2/subversion/libsvn_fs/fs-loader.c#line-293

> 
> >
> >>at is that I need to know which characters, if any, I need to encode
> >>before passing them to Subversion. If Subversion supports any
> >>Unicode character, I can just pass the path decoded and sleep
> >>soundly at night. If not, I need to know which ones to decode and
> >>which ones to pass through.
> >Err, that depends on what API layer you're working with.  (For example:
> >svn_fs.h is perfectly happy with :,*,\n as part of the basename, but
> >libsvn_wc on windows, and the mergeinfo logic, aren't.)
> 
> Oh, that's bad news. In your previous reply you said, "the notion of
> 'valid pathname in a Subversion
> filesystem' does not depend on the FS backend in use." Now you seem
> to say "whether some pathname is valid or not it depends on whether
> you 're on Windows or some other platform."

No.  I said that "* is a valid pathname character in the Subversion
filesystem", and that assertion is true on Windows too.

You're welcome to grep for '#ifdef WIN32' in subversion/tests/*_fs/.

> (Even worse, you seem to be saying that the notion of "valid pathname"
> isn't even consistent across the API.)

Yes.  If a pathname contains "*" or ":", it is my understanding that
trying to set mergeinfo on that path won't work well.  (I haven't tried
to reproduce this, but it would be trivial to do so on a Unix-y system.)
See svn_mergeinfo_to_string().

> 
> >And 'what to encode/decode' is a rather vague question.  I'm not sure if
> >it means "Does `svn info uri:///foo bar` == `svn info uri:///foo%20bar`?"
> >or something else.  Can you be more concrete?
> 
> It doesn't matter. It's some black box that works like this:
> 
> String encode(String input);
> String decode(String output);
> 
> I can come up with a thousand ways to encode/decode. I can use %hh.
> I can use ^0xhh. The only two requirements are that 1) encode()
> provides me with a string guaranteed to be a valid pathname, and 2)
> decode() will take the encoded string and give me back the decoded
> string I started with.
> 
> But to meet requirement #1, I have to know which characters are
> considered valid and which aren't. That's what I don't know, and
> that's what I'm asking:
> 
> 1. Does the API guarantee that a "valid pathname" (whatever that is) is
>    the same across all platforms? I thought you said yes, but now it
>    seems you're saying no. (If you say "no", then there's no point in

The answer is "Yes" for all server-side API's.

Re: URI-encoding on 1.7 repository?

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.
Garret Wilson wrote on Fri, Jan 20, 2012 at 18:18:24 -0800:
> On 1/20/2012 6:14 PM, Daniel Shahaf wrote:
> >You don't care what FS backend the server runs. All you care is
> >that the endpoint of svn_ra_open4() implements the Subversion RA
> >API properly. Normal Subversion servers use svn_fs.h which in turn
> >presents the same API _regardless of which backend is used_. I'll
> >spell it out: the notion of 'valid pathname in a Subversion
> >filesystem' does not depend on the FS backend in use.
> 
> All that is good news. So I guess the important question is: what
> spells out "the notion of 'valid pathname on a Subversion
> filesystem'"? Is it "any valid Unicode code point?" What I'm getting

See my previous reply.

> at is that I need to know which characters, if any, I need to encode
> before passing them to Subversion. If Subversion supports any
> Unicode character, I can just pass the path decoded and sleep
> soundly at night. If not, I need to know which ones to decode and
> which ones to pass through.

Err, that depends on what API layer you're working with.  (For example:
svn_fs.h is perfectly happy with :,*,\n as part of the basename, but
libsvn_wc on windows, and the mergeinfo logic, aren't.)

And 'what to encode/decode' is a rather vague question.  I'm not sure if
it means "Does `svn info uri:///foo bar` == `svn info uri:///foo%20bar`?"
or something else.  Can you be more concrete?

> 
> Garret

Re: URI-encoding on 1.7 repository?

Posted by Garret Wilson <ga...@globalmentor.com>.
On 1/20/2012 6:14 PM, Daniel Shahaf wrote:
> You don't care what FS backend the server runs. All you care is that 
> the endpoint of svn_ra_open4() implements the Subversion RA API 
> properly. Normal Subversion servers use svn_fs.h which in turn 
> presents the same API _regardless of which backend is used_. I'll 
> spell it out: the notion of 'valid pathname in a Subversion 
> filesystem' does not depend on the FS backend in use. 

All that is good news. So I guess the important question is: what spells 
out "the notion of 'valid pathname on a Subversion filesystem'"? Is it 
"any valid Unicode code point?" What I'm getting at is that I need to 
know which characters, if any, I need to encode before passing them to 
Subversion. If Subversion supports any Unicode character, I can just 
pass the path decoded and sleep soundly at night. If not, I need to know 
which ones to decode and which ones to pass through.

Garret

Re: URI-encoding on 1.7 repository?

Posted by Daniel Shahaf <da...@elego.de>.
Garret Wilson wrote on Fri, Jan 20, 2012 at 10:38:28 -0800:
> From my end I'm actually starting with 100% canonically-encoded URIs
> to begin with. If Subversion is storing these things in decoded form
> on the back end, does it compensate for characters not supported by
> the underlying file system? So when I take my URI and I decode it
> just so I can save the filename the way Subversion likes, how do I
> know which characters to decode (those supported by the underlying
> file system---as if I, the client know what that is!) but which
> characters to leave encoded (those not supported by the underlying
> file system on the server)?

You don't care what FS backend the server runs.  All you care is that
the endpoint of svn_ra_open4() implements the Subversion RA API
properly.  Normal Subversion servers use svn_fs.h which in turn presents
the same API _regardless of which backend is used_.

I'll spell it out: the notion of 'valid pathname in a Subversion
filesystem' does not depend on the FS backend in use.

Re: URI-encoding on 1.7 repository?

Posted by Daniel Shahaf <da...@elego.de>.
Garret Wilson wrote on Fri, Jan 20, 2012 at 10:38:28 -0800:
> Let's say I have the file "a b.txt", which consists of "a" and "b"
> with a space in between. How should this be stored on the server?

Implementation detail.  The API promises that Subversion filesystem can
have two sibling dirents named "a b" and "a%20b".  See svn_fs__path_valid()'s
docstring for a pointer to the public API docs.

> How should the various APIs give it to me?

%-encoded when part of a URI, and unencoded otherwise.