You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@subversion.apache.org by "B. Blodau" <b_...@hamburg.de> on 2008/02/12 09:52:19 UTC

UTF-8 support for Unix with APR?

I wrote a similar request for this topic earlier, but now it becomes  
a more general issue.

My questions are:
- Can subversion support utf8 filenames on Unix systems when using  
the apr libraries?
- Has anybody used the C-Libraries on a unix system (including MacOS  
X) and successfully used international pathnames?

I'm writing a C++ appliation which uses the svn libraries and  
therefore inherits the apr libraries too.
When working with international filenames I'm getting errors that  
characters could not be converted from utf8 to the local encoding.

Since my app is a Unicode safe application I don't want the filenames  
be converted, because I want to keep the whole Unicode character set.  
Even a successful conversion to the encoding of the current user  
locale, would result in a limited character set.

When debugging this a bit futher I came to the follwoing code snippet in
".../apr/file_io/unix/filepath.c":

APR_DECLARE(apr_status_t) apr_filepath_encoding(int *style,  
apr_pool_t *p)
{
     *style = APR_FILEPATH_ENCODING_LOCALE;
     return APR_SUCCESS;
}

This looks as if - at least for Unix - no utf8 support is intended.  
Otherwise this function should return APR_FILEPATH_ENCODING_UTF8.

Can anybody confirm my concern? I just don't want to search for a  
solution where I don't have any chance.

Thanks a lor for your attention.
Bert

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: UTF-8 support for Unix with APR?

Posted by Erik Huelsmann <eh...@gmail.com>.
On 2/12/08, Erik Huelsmann <eh...@gmail.com> wrote:
> On 2/12/08, B. Blodau <b_...@hamburg.de> wrote:
> > I wrote a similar request for this topic earlier, but now it becomes
> > a more general issue.
> >
> > My questions are:
> > - Can subversion support utf8 filenames on Unix systems when using
> > the apr libraries?
> > - Has anybody used the C-Libraries on a unix system (including MacOS
> > X) and successfully used international pathnames?
> >
> > I'm writing a C++ appliation which uses the svn libraries and
> > therefore inherits the apr libraries too.
> > When working with international filenames I'm getting errors that
> > characters could not be converted from utf8 to the local encoding.
> >
> > Since my app is a Unicode safe application I don't want the filenames
> > be converted, because I want to keep the whole Unicode character set.
> > Even a successful conversion to the encoding of the current user
> > locale, would result in a limited character set.
> >
> > When debugging this a bit futher I came to the follwoing code snippet in
> > ".../apr/file_io/unix/filepath.c":
> >
> > APR_DECLARE(apr_status_t) apr_filepath_encoding(int *style,
> > apr_pool_t *p)
> > {
> >     *style = APR_FILEPATH_ENCODING_LOCALE;
> >     return APR_SUCCESS;
> > }
> >
> > This looks as if - at least for Unix - no utf8 support is intended.
> > Otherwise this function should return APR_FILEPATH_ENCODING_UTF8.
>
> The APR libraries handle file paths in the system locale. This means
> they *may* be encoded in UTF-8, but are not necessarily. When they are
> interpreted as UTF-8 depends on the LANG or LC_CTYPE settings in the
> host environment.
>
> LANG=en_US.UTF-8
>
> will indicate UTF-8 pathnames. OTOH,
>
> LANG=en_US.iso8859-1
>
> will indicate "latin1" pathnames. The application is free to do with
> that information whatever it wants. Subversion uses the returned value
> to determine whether there's any "locale"->"UTF8" conversion
> necessary, since internally it entirely uses UTF8 encoded pathnames.
>
> > Can anybody confirm my concern? I just don't want to search for a
> > solution where I don't have any chance.
>
> If you encounter conversion problems, chances are you didn't provide
> any information to your application regarding the locale settings to
> be used: only when you do that, then will Subversion know what the
> source encoding to be used is (and it won't convert if the source is
> already considered UTF8). Did you call setlocale() (the C library
> function)?

BTW, APR supports the UTF-8 encoding of filepaths as system standard
operation on MacOSX starting 0.9.15 (if you're on the 0.9 branch) or -
I believe - the first 1.2 release after August 13th (2007).

HTH,

Erik.

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: UTF-8 support for Unix with APR?

Posted by Vincent Lefevre <vi...@vinc17.org>.
On 2008-02-15 13:27:31 +0100, Erik Huelsmann wrote:
> Right. But all of that means it's not so deterministic that either
> APR or Subversion can solve *all* locale problems.

Perhaps they can't solve them on their own, but they should give the
user the possibility to solve them. For instance, there could be an
option (to be put in .subversion/config and/or a command-line switch)
to say: assume the current locale for input (command-line arguments
and terminal) and output (terminal), and the configured locale for
everything else.

BTW, there's a similar problem to generate ChangeLog files, which
should be in UTF-8 for most projects. What if a user wants to generate
such a file from an ISO-8859-1 terminal?

There are tools that have an option to specify the encoding of the
output (e.g. xmllint and its --encode option). Why not Subversion
for "svn log"?

-- 
Vincent Lefèvre <vi...@vinc17.org> - Web: <http://www.vinc17.org/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.org/blog/>
Work: CR INRIA - computer arithmetic / Arenaire project (LIP, ENS-Lyon)

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: UTF-8 support for Unix with APR?

Posted by Erik Huelsmann <eh...@gmail.com>.
On Fri, Feb 15, 2008 at 1:12 PM, Vincent Lefevre <vi...@vinc17.org> wrote:
> On 2008-02-13 16:14:28 +0100, Erik Huelsmann wrote:
>  > Well, yes and no :-) Subversion depends (more so than, say, /bin/ls)
>  > on a sanely configured environment (locale on disk == locale in
>  > terminal, locale configured in the first place, etc).
>
>  You're assuming too much. Unix is designed so that the user can use
>  different locales, e.g. in different terminals. The locales are not
>  global to the system (unlike the host name) or even the network (for
>  the NFS users). And some applications are designed to do charset
>  conversion because of that ("screen" is a good example). Moreover
>  different users will typically have different locales, and possibly
>  need to access to the data of other users. Also think about a USB
>  key that will be used in various environments...
>
>
>  > The effect is that Subversion doesn't recognize 2 filenames being the
>  > same when in fact they are differently encoded. This issue has long
>  > gone undetected, because many OSes seem to prefer either one or the
>  > other encoding (Windows and Linux prefer NFC,
>
>  I don't know about Windows, but Linux does *not* prefer NFC. It will
>  accept whatever the user will use. This can be both NFC and NFD (so
>  that the user may end up with two files with the same apparent name,
>  in particular after scp between Linux and Mac OS X), broken UTF-8
>  sequences or other encodings. Fortunately some applications (e.g.
>  GNOME ones) enforce some conventions by default.
>
>
>  > Solaris I don't know, but Mac prefers NFD).
>
>  in fact HFS+.
>
>
>  On 2008-02-13 21:56:10 +0100, Erik Huelsmann wrote:
>  > Ah! but the Mac (although that was snipped out of the quote) was
>  > exempt from 'Normal unix behaviour', since they use UTF-8 on disk *all
>  > the time*. The rest of the unix world uses LC_CTYPE, LC_ALL or LANG
>  > environment variables to determine what the current locale is. It then
>  > applies that setting both to paths on the disk as well as any output
>  > sent to the terminal.
>
>  No, it doesn't apply to pathnames. The encoding is left unspecified, and
>  may depend on the file system, and the system just see filenames as a
>  sequence of bytes (BTW, many system scripts set the locale back to C,
>  but they must work with filenames containing non-ASCII characters).
>
>  It would be more correct to say that most software doesn't support
>  filenames with non-ASCII characters. A real support would mean charset
>  conversion between the encoding on disk and the current locale.
>
>
>  > > This is the locale I know about. "LANG=en_US.UTF-8" and so forth.
>  >
>  > But, as stated above, in the rest of the unix world, LANG= also
>  > applies to paths read from disk.
>
>  No, see GNOME applications, for instance. This is mainly a question
>  of convention.
>
>  Also, at my previous lab, the NFS system has been changed to a NAS that
>  supports both Unix and Windows, and for this reason, the filenames had
>  to be interpreted as sequences of characters. Now, how the system could
>  guess the locale used by each user? You see, having an encoding based
>  on the current locale is broken by design. FYI, all the users who chose
>  a UTF-8 incompatible encoding had their filenames munged.
>
>
>  > > Is that when I first checked out a working copy? when I first made
>  > > a repository? when I first installed Subversion? when I first
>  > > installed the OS?
>  >
>  > When you installed your windows (presumably), or when you last created
>  > your Unix user.
>   ^^^^^^^^^^^^^^
>  I suppose you meant OS installation. Unix is a multi-user system!
>
>  Now, do you want every user of some USB key to have installed their
>  machine in the same way? That's incredible!
>
>
>  > And that's correct. With the right choice of pathnames the sequence of
>  > commands below could be broken (the second command will return a
>  > "Non-conforming UTF-8 sequence encountered." error):
>  >
>  > $ LANG=en_US.iso88591 svn checkout URL your-path
>  > $ LANG=en_US.UTF-8 svn update your-path
>  >
>  > Now, Subversion could remember that the path was checked out using the
>  > latin1 setting, but essentially you're telling it you changed your
>  > paths (and output) to UTF-8. Should it ignore that? Absolutely not!
>  > You might be (*should* be) right, in which case you'd end up with the
>  > wrong UTF-8, when it's being read as if it were the latin1 which you
>  > checked out...
>
>  Well, there should be a (possibly optional) way to say: use this
>  encoding for pathnames *on disk*, and use this other encoding for
>  input/output.
>
>  In a similar way, when I read/write a file with my text editor, it
>  shouldn't expect it to be always in the charset specified by the
>  current locale.
>
>  BTW, the notion of locale is old and was created when users usually
>  worked in a single environment and didn't exchange data very much.
>  Things have evolved. Nowadays, most software is able to work with
>  various charsets (sometimes recording the charset together with the
>  contents, e.g. in XML, mail messages...), instead of sticking to the
>  current locale.


Right. But all of that means it's not so deterministic that either APR
or Subversion can solve *all* locale problems.

bye,

Erik.

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: UTF-8 support for Unix with APR?

Posted by Vincent Lefevre <vi...@vinc17.org>.
On 2008-02-13 16:14:28 +0100, Erik Huelsmann wrote:
> Well, yes and no :-) Subversion depends (more so than, say, /bin/ls)
> on a sanely configured environment (locale on disk == locale in
> terminal, locale configured in the first place, etc).

You're assuming too much. Unix is designed so that the user can use
different locales, e.g. in different terminals. The locales are not
global to the system (unlike the host name) or even the network (for
the NFS users). And some applications are designed to do charset
conversion because of that ("screen" is a good example). Moreover
different users will typically have different locales, and possibly
need to access to the data of other users. Also think about a USB
key that will be used in various environments...

> The effect is that Subversion doesn't recognize 2 filenames being the
> same when in fact they are differently encoded. This issue has long
> gone undetected, because many OSes seem to prefer either one or the
> other encoding (Windows and Linux prefer NFC,

I don't know about Windows, but Linux does *not* prefer NFC. It will
accept whatever the user will use. This can be both NFC and NFD (so
that the user may end up with two files with the same apparent name,
in particular after scp between Linux and Mac OS X), broken UTF-8
sequences or other encodings. Fortunately some applications (e.g.
GNOME ones) enforce some conventions by default.

> Solaris I don't know, but Mac prefers NFD).

in fact HFS+.

On 2008-02-13 21:56:10 +0100, Erik Huelsmann wrote:
> Ah! but the Mac (although that was snipped out of the quote) was
> exempt from 'Normal unix behaviour', since they use UTF-8 on disk *all
> the time*. The rest of the unix world uses LC_CTYPE, LC_ALL or LANG
> environment variables to determine what the current locale is. It then
> applies that setting both to paths on the disk as well as any output
> sent to the terminal.

No, it doesn't apply to pathnames. The encoding is left unspecified, and
may depend on the file system, and the system just see filenames as a
sequence of bytes (BTW, many system scripts set the locale back to C,
but they must work with filenames containing non-ASCII characters).

It would be more correct to say that most software doesn't support
filenames with non-ASCII characters. A real support would mean charset
conversion between the encoding on disk and the current locale.

> > This is the locale I know about. "LANG=en_US.UTF-8" and so forth.
> 
> But, as stated above, in the rest of the unix world, LANG= also
> applies to paths read from disk.

No, see GNOME applications, for instance. This is mainly a question
of convention.

Also, at my previous lab, the NFS system has been changed to a NAS that
supports both Unix and Windows, and for this reason, the filenames had
to be interpreted as sequences of characters. Now, how the system could
guess the locale used by each user? You see, having an encoding based
on the current locale is broken by design. FYI, all the users who chose
a UTF-8 incompatible encoding had their filenames munged.

> > Is that when I first checked out a working copy? when I first made
> > a repository? when I first installed Subversion? when I first
> > installed the OS?
> 
> When you installed your windows (presumably), or when you last created
> your Unix user.
  ^^^^^^^^^^^^^^
I suppose you meant OS installation. Unix is a multi-user system!

Now, do you want every user of some USB key to have installed their
machine in the same way? That's incredible!

> And that's correct. With the right choice of pathnames the sequence of
> commands below could be broken (the second command will return a
> "Non-conforming UTF-8 sequence encountered." error):
> 
> $ LANG=en_US.iso88591 svn checkout URL your-path
> $ LANG=en_US.UTF-8 svn update your-path
> 
> Now, Subversion could remember that the path was checked out using the
> latin1 setting, but essentially you're telling it you changed your
> paths (and output) to UTF-8. Should it ignore that? Absolutely not!
> You might be (*should* be) right, in which case you'd end up with the
> wrong UTF-8, when it's being read as if it were the latin1 which you
> checked out...

Well, there should be a (possibly optional) way to say: use this
encoding for pathnames *on disk*, and use this other encoding for
input/output.

In a similar way, when I read/write a file with my text editor, it
shouldn't expect it to be always in the charset specified by the
current locale.

BTW, the notion of locale is old and was created when users usually
worked in a single environment and didn't exchange data very much.
Things have evolved. Nowadays, most software is able to work with
various charsets (sometimes recording the charset together with the
contents, e.g. in XML, mail messages...), instead of sticking to the
current locale.

-- 
Vincent Lefèvre <vi...@vinc17.org> - Web: <http://www.vinc17.org/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.org/blog/>
Work: CR INRIA - computer arithmetic / Arenaire project (LIP, ENS-Lyon)

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: UTF-8 support for Unix with APR?

Posted by Erik Huelsmann <eh...@gmail.com>.
On Feb 13, 2008 9:25 PM, Ryan Schmidt <su...@ryandesign.com> wrote:
>
> On Feb 13, 2008, at 09:14, Erik Huelsmann wrote:
>
> >> SVN doesn't get it right either since it's ignorant of
> >> unicode
> >> normalization forms [1].
> >
> > Well, yes and no :-) Subversion depends (more so than, say, /bin/ls)
> > on a sanely configured environment (locale on disk == locale in
> > terminal, locale configured in the first place, etc). This is fine,
> > since Subversion needs to operate accross different configurations and
> > even OSes (whereas /bin/ls does not).
>
> Hold up for a second... I'm havin' a little trouble...

Ok. Lemme explain.

> > locale on disk
>
> What is this? I know that on my Mac, ...

Ah! but the Mac (although that was snipped out of the quote) was
exempt from 'Normal unix behaviour', since they use UTF-8 on disk *all
the time*. The rest of the unix world uses LC_CTYPE, LC_ALL or LANG
environment variables to determine what the current locale is. It then
applies that setting both to paths on the disk as well as any output
sent to the terminal.

>
> > == locale in terminal,
>
> This is the locale I know about. "LANG=en_US.UTF-8" and so forth.

But, as stated above, in the rest of the unix world, LANG= also
applies to paths read from disk. The Mac situation seems more sane,
but unfortunately isn't widespread...

> > locale configured in the first place
>
> What is this? What is "in the first place"?

The fact that you actually *have* a LANG= setting. If you don't,
you're restricted to using ASCII characters (because you're restricted
to the default "C" locale which only supports ascii characters).

> Is that when I first
> checked out a working copy? when I first made a repository? when I
> first installed Subversion? when I first installed the OS?

When you installed your windows (presumably), or when you last created
your Unix user. On the Mac, it's a system convention, so no need to
configure what locale to expect from the disk. For the rest of the
world, I have no idea when or how locale settings may be influenced.

> It sounded like Vincent was saying that if a working copy is created
> under one terminal locale setting, but then accessed with a different
> terminal locale setting, things don't work right.

And that's correct. With the right choice of pathnames the sequence of
commands below could be broken (the second command will return a
"Non-conforming UTF-8 sequence encountered." error):

$ LANG=en_US.iso88591 svn checkout URL your-path
$ LANG=en_US.UTF-8 svn update your-path

Now, Subversion could remember that the path was checked out using the
latin1 setting, but essentially you're telling it you changed your
paths (and output) to UTF-8. Should it ignore that? Absolutely not!
You might be (*should* be) right, in which case you'd end up with the
wrong UTF-8, when it's being read as if it were the latin1 which you
checked out...


HTH,

Erik.

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: UTF-8 support for Unix with APR?

Posted by Ryan Schmidt <su...@ryandesign.com>.
On Feb 13, 2008, at 09:14, Erik Huelsmann wrote:

>>>> This is broken. APR should switch to UTF-8 locales internally  
>>>> when it
>>>> deals with filenames (like what GNOME apps do). Otherwise this  
>>>> leads
>>>> to consistency problems when the user has both ISO-8859-1 and UTF-8
>>>> terminal sessions (the reason is that some applications and/or some
>>>> machines do not support multibyte character sets, and one wouldn't
>>>> want to mess everything when running svn in degraded mode, i.e.  
>>>> with
>>>> ISO-8859-1 locales).
>>>
>>> No. The way (non-Mac) unices deal with this is seriously broken.  
>>> There
>>> is *no* guarantee the actual input paths are the encoding claimed by
>>> the locale settings.
>>>
>>> There is no way for APR to solve that issue. The only thing it  
>>> can do
>>> is tell the application which input it should expect. Subversion
>>> offers conversion routines to do the actual "locale"->UTF8 path
>>> conversion since Subversion actually *is* UTF8 "inside", meaning  
>>> that
>>> it's ok for Subversion to err when it encounters invalid (ie non- 
>>> UTF8)
>>> input. Not all APR applications may find that desirable (for  
>>> example:
>>> Apache httpd doesn't initialise locale settings, so, it can't do
>>> locale->utf8 conversions [as the C runtime doesn't know what the
>>> current locale is]; nor will it change that behaviour.)
>>
>> It's worse. SVN doesn't get it right either since it's ignorant of  
>> unicode
>> normalization forms [1].
>
> Well, yes and no :-) Subversion depends (more so than, say, /bin/ls)
> on a sanely configured environment (locale on disk == locale in
> terminal, locale configured in the first place, etc). This is fine,
> since Subversion needs to operate accross different configurations and
> even OSes (whereas /bin/ls does not).

Hold up for a second... I'm havin' a little trouble...

> locale on disk

What is this? I know that on my Mac, I use the HFS+ filesystem which  
stores filenames in UTF-16. But that's a character encoding, and it's  
not configurable; it's an integral part of the HFS+ specification.  
Are you saying there's also an associated locale in the filesystem? I  
don't think I've ever been asked to set one, and I don't know how I  
would do so nor how I would figure out what it's set to now...

> == locale in terminal,

This is the locale I know about. "LANG=en_US.UTF-8" and so forth.  
When I do this, Terminal knows how to display filenames from the disk  
correctly because it converts the UTF-16 characters on disk into  
UTF-8 characters for display in Terminal. Similarly, various commands  
like ls and svn know that I want them to output UTF-8 characters to  
the terminal.

> locale configured in the first place

What is this? What is "in the first place"? Is that when I first  
checked out a working copy? when I first made a repository? when I  
first installed Subversion? when I first installed the OS?

It sounded like Vincent was saying that if a working copy is created  
under one terminal locale setting, but then accessed with a different  
terminal locale setting, things don't work right.


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: UTF-8 support for Unix with APR?

Posted by "B. Blodau" <b_...@hamburg.de>.
Hi,
just for your information:

Calling 'setlocale(LC_ALL, "en_us.UTF-8")' solved my problem on the  
Mac. I can now commit and update files with umlauts or even chinese  
characters.

Thanks for everybody who helped!
Bert

Am 13.02.2008 um 16:14 schrieb Erik Huelsmann:

>>>>> The APR libraries handle file paths in the system locale. This  
>>>>> means
>>>>> they *may* be encoded in UTF-8, but are not necessarily. When  
>>>>> they are
>>>>> interpreted as UTF-8 depends on the LANG or LC_CTYPE settings  
>>>>> in the
>>>>> host environment.
>>>>
>>>> This is broken. APR should switch to UTF-8 locales internally  
>>>> when it
>>>> deals with filenames (like what GNOME apps do). Otherwise this  
>>>> leads
>>>> to consistency problems when the user has both ISO-8859-1 and UTF-8
>>>> terminal sessions (the reason is that some applications and/or some
>>>> machines do not support multibyte character sets, and one wouldn't
>>>> want to mess everything when running svn in degraded mode, i.e.  
>>>> with
>>>> ISO-8859-1 locales).
>>>
>>> No. The way (non-Mac) unices deal with this is seriously broken.  
>>> There
>>> is *no* guarantee the actual input paths are the encoding claimed by
>>> the locale settings.
>>>
>>> There is no way for APR to solve that issue. The only thing it  
>>> can do
>>> is tell the application which input it should expect. Subversion
>>> offers conversion routines to do the actual "locale"->UTF8 path
>>> conversion since Subversion actually *is* UTF8 "inside", meaning  
>>> that
>>> it's ok for Subversion to err when it encounters invalid (ie non- 
>>> UTF8)
>>> input. Not all APR applications may find that desirable (for  
>>> example:
>>> Apache httpd doesn't initialise locale settings, so, it can't do
>>> locale->utf8 conversions [as the C runtime doesn't know what the
>>> current locale is]; nor will it change that behaviour.)
>>
>> It's worse. SVN doesn't get it right either since it's ignorant of  
>> unicode
>> normalization forms [1].
>
> Well, yes and no :-) Subversion depends (more so than, say, /bin/ls)
> on a sanely configured environment (locale on disk == locale in
> terminal, locale configured in the first place, etc). This is fine,
> since Subversion needs to operate accross different configurations and
> even OSes (whereas /bin/ls does not).
>
>> OS X always encodes file names in NFD while other
>> unix systems don't standardize this at all, though in practice  
>> they tend to
>> use NFC.
>
> Right. This issue is actually not 'worse', but different than the
> other one. (Alas not less unfortunate.) When the Subversion devs (yes,
> I'm one of them) decided to use UTF-8, they didn't realise there are 4
> Unicode normal forms. Fortunately, 2 are irrelevant here, leaving
> 'only' 2 forms. Some (many) filenames will be binary different when
> encoded in one form vs the other (NFC vs NFD) as you describe below.
>
>> The same name in NFD and NFC will be represented by a different
>> sequence and number of unicode code points if it contains e.g.  
>> accented
>> characters.
>
> The effect is that Subversion doesn't recognize 2 filenames being the
> same when in fact they are differently encoded. This issue has long
> gone undetected, because many OSes seem to prefer either one or the
> other encoding (Windows and Linux prefer NFC, Solaris I don't know,
> but Mac prefers NFD). When working between Windows and Linux, nobody
> will notice. Neither will Mac users exchanging files.
>
> Many open source projects won't notice either even though they
> exchange between Windows, Linux and Mac, since they restrict
> themselves to ascii filenames. This leaves mixed Windows/Linux and Mac
> setups with accented characters at loss.
>
>> See also subversion issue 2464 [2].
>>
>> [1] http://unicode.org/reports/tr15
>> [2] http://subversion.tigris.org/issues/show_bug.cgi?id=2464
>
> Right. I've written a number of e-mails on the issue, but the other
> developers were too busy working on 1.5 at the time to be open for
> discussion on the issue. I haven't forgotten about it, but this issue
> isn't as easy to solve as it was to solve the "APR doesn't work with
> UTF-8" issue was, because a very large legacy repositories has built
> up in the mean time. We don't want to break those.
>
> We'll be working on it. It's not worse, but unfortunately, the
> resolution to the problem contained a few problems itself and we'll be
> solving those. Hopefully by 1.6.
>
> Bye,
>
>
> Erik.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
> For additional commands, e-mail: users-help@subversion.tigris.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: UTF-8 support for Unix with APR?

Posted by Erik Huelsmann <eh...@gmail.com>.
> > > > The APR libraries handle file paths in the system locale. This means
> > > > they *may* be encoded in UTF-8, but are not necessarily. When they are
> > > > interpreted as UTF-8 depends on the LANG or LC_CTYPE settings in the
> > > > host environment.
> > >
> > > This is broken. APR should switch to UTF-8 locales internally when it
> > > deals with filenames (like what GNOME apps do). Otherwise this leads
> > > to consistency problems when the user has both ISO-8859-1 and UTF-8
> > > terminal sessions (the reason is that some applications and/or some
> > > machines do not support multibyte character sets, and one wouldn't
> > > want to mess everything when running svn in degraded mode, i.e. with
> > > ISO-8859-1 locales).
> >
> > No. The way (non-Mac) unices deal with this is seriously broken. There
> > is *no* guarantee the actual input paths are the encoding claimed by
> > the locale settings.
> >
> > There is no way for APR to solve that issue. The only thing it can do
> > is tell the application which input it should expect. Subversion
> > offers conversion routines to do the actual "locale"->UTF8 path
> > conversion since Subversion actually *is* UTF8 "inside", meaning that
> > it's ok for Subversion to err when it encounters invalid (ie non-UTF8)
> > input. Not all APR applications may find that desirable (for example:
> > Apache httpd doesn't initialise locale settings, so, it can't do
> > locale->utf8 conversions [as the C runtime doesn't know what the
> > current locale is]; nor will it change that behaviour.)
>
> It's worse. SVN doesn't get it right either since it's ignorant of unicode
> normalization forms [1].

Well, yes and no :-) Subversion depends (more so than, say, /bin/ls)
on a sanely configured environment (locale on disk == locale in
terminal, locale configured in the first place, etc). This is fine,
since Subversion needs to operate accross different configurations and
even OSes (whereas /bin/ls does not).

> OS X always encodes file names in NFD while other
> unix systems don't standardize this at all, though in practice they tend to
> use NFC.

Right. This issue is actually not 'worse', but different than the
other one. (Alas not less unfortunate.) When the Subversion devs (yes,
I'm one of them) decided to use UTF-8, they didn't realise there are 4
Unicode normal forms. Fortunately, 2 are irrelevant here, leaving
'only' 2 forms. Some (many) filenames will be binary different when
encoded in one form vs the other (NFC vs NFD) as you describe below.

> The same name in NFD and NFC will be represented by a different
> sequence and number of unicode code points if it contains e.g. accented
> characters.

The effect is that Subversion doesn't recognize 2 filenames being the
same when in fact they are differently encoded. This issue has long
gone undetected, because many OSes seem to prefer either one or the
other encoding (Windows and Linux prefer NFC, Solaris I don't know,
but Mac prefers NFD). When working between Windows and Linux, nobody
will notice. Neither will Mac users exchanging files.

Many open source projects won't notice either even though they
exchange between Windows, Linux and Mac, since they restrict
themselves to ascii filenames. This leaves mixed Windows/Linux and Mac
setups with accented characters at loss.

> See also subversion issue 2464 [2].
>
> [1] http://unicode.org/reports/tr15
> [2] http://subversion.tigris.org/issues/show_bug.cgi?id=2464

Right. I've written a number of e-mails on the issue, but the other
developers were too busy working on 1.5 at the time to be open for
discussion on the issue. I haven't forgotten about it, but this issue
isn't as easy to solve as it was to solve the "APR doesn't work with
UTF-8" issue was, because a very large legacy repositories has built
up in the mean time. We don't want to break those.

We'll be working on it. It's not worse, but unfortunately, the
resolution to the problem contained a few problems itself and we'll be
solving those. Hopefully by 1.6.

Bye,


Erik.

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: UTF-8 support for Unix with APR?

Posted by B Smith-Mannschott <bs...@gmail.com>.
On Feb 13, 2008 1:41 PM, Erik Huelsmann <eh...@gmail.com> wrote:

> On 2/13/08, Vincent Lefevre <vi...@vinc17.org> wrote:
> > On 2008-02-12 13:52:41 +0100, Erik Huelsmann wrote:
> > > The APR libraries handle file paths in the system locale. This means
> > > they *may* be encoded in UTF-8, but are not necessarily. When they are
> > > interpreted as UTF-8 depends on the LANG or LC_CTYPE settings in the
> > > host environment.
> >
> > This is broken. APR should switch to UTF-8 locales internally when it
> > deals with filenames (like what GNOME apps do). Otherwise this leads
> > to consistency problems when the user has both ISO-8859-1 and UTF-8
> > terminal sessions (the reason is that some applications and/or some
> > machines do not support multibyte character sets, and one wouldn't
> > want to mess everything when running svn in degraded mode, i.e. with
> > ISO-8859-1 locales).
>
> No. The way (non-Mac) unices deal with this is seriously broken. There
> is *no* guarantee the actual input paths are the encoding claimed by
> the locale settings.
>
> There is no way for APR to solve that issue. The only thing it can do
> is tell the application which input it should expect. Subversion
> offers conversion routines to do the actual "locale"->UTF8 path
> conversion since Subversion actually *is* UTF8 "inside", meaning that
> it's ok for Subversion to err when it encounters invalid (ie non-UTF8)
> input. Not all APR applications may find that desirable (for example:
> Apache httpd doesn't initialise locale settings, so, it can't do
> locale->utf8 conversions [as the C runtime doesn't know what the
> current locale is]; nor will it change that behaviour.)
>


It's worse. SVN doesn't get it right either since it's ignorant of unicode
normalization forms [1]. OS X always encodes file names in NFD while other
unix systems don't standardize this at all, though in practice they tend to
use NFC.  The same name in NFD and NFC will be represented by a different
sequence and number of unicode code points if it contains e.g. accented
characters. See also subversion issue 2464 [2].

[1] http://unicode.org/reports/tr15
[2] http://subversion.tigris.org/issues/show_bug.cgi?id=2464

-- 
// Ben Smith-Mannschott

Re: UTF-8 support for Unix with APR?

Posted by Vincent Lefevre <vi...@vinc17.org>.
On 2008-02-13 13:41:04 +0100, Erik Huelsmann wrote:
> No. The way (non-Mac) unices deal with this is seriously broken.

Yes, that's why there are workarounds.

> There is *no* guarantee the actual input paths are the encoding
> claimed by the locale settings.

Agreed. But one of the problems is that svn doesn't remember the
convention that has been chosen. Let's take an example. The user
has done a checkout under UTF-8 locales, and there was a file
called aé in the repository. The user can type "svn st", which
outputs nothing, as expected.

Now, for some reason, the user needs to use an ISO-8859-1 terminal
session. Then he cd's to the working copy and types "svn st", but
he gets:

vin:~tmp/wc> svn st
?      aé
!      aé

So, instead of just having a display problem under some cases, the
user has to face a much more important problem. One even gets a
cryptic error message with "svn up":

vin:~tmp/wc> svn up
svn: Can't copy '.svn/text-base/aé.svn-base' to '.svn/tmp/aé.tmp.tmp': Success
zsh: exit 1     svn up

It is annoying to have such problems even though the user doesn't
manipulate non-ASCII characters himself ("svn st" and "svn up" are
commands using plain ASCII).

Note: Using a wrapper to start svn in UTF-8 locales would avoid these
problems, but this would also add other problems, e.g. error messages
(in non-English language) would be output with an incorrect encoding
to the terminal. And unfortunately, the user has currently no way to
tell svn to use some encoding for the filenames (in the file system)
and some other encoding for the output.

> There is no way for APR to solve that issue. The only thing it can
> do is tell the application which input it should expect.

Note that in the above example, all input is in plain ASCII. So, the
problem is more than the encoding of the input.

FYI, because of the use of various locales and various OS (Linux,
which doesn't do any normalization, and Mac OS X, which has chosen
NFD), I have personally chosen not to use non-ASCII characters in my
filenames. But I expect svn to behave in a sensible way when dealing
with non-ASCII characters in filenames created by other people, at
least when I don't use these filenames directly (e.g. with "svn st"
and "svn up" like in my example above).

-- 
Vincent Lefèvre <vi...@vinc17.org> - Web: <http://www.vinc17.org/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.org/blog/>
Work: CR INRIA - computer arithmetic / Arenaire project (LIP, ENS-Lyon)

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: UTF-8 support for Unix with APR?

Posted by Erik Huelsmann <eh...@gmail.com>.
On 2/13/08, Vincent Lefevre <vi...@vinc17.org> wrote:
> On 2008-02-12 13:52:41 +0100, Erik Huelsmann wrote:
> > The APR libraries handle file paths in the system locale. This means
> > they *may* be encoded in UTF-8, but are not necessarily. When they are
> > interpreted as UTF-8 depends on the LANG or LC_CTYPE settings in the
> > host environment.
>
> This is broken. APR should switch to UTF-8 locales internally when it
> deals with filenames (like what GNOME apps do). Otherwise this leads
> to consistency problems when the user has both ISO-8859-1 and UTF-8
> terminal sessions (the reason is that some applications and/or some
> machines do not support multibyte character sets, and one wouldn't
> want to mess everything when running svn in degraded mode, i.e. with
> ISO-8859-1 locales).

No. The way (non-Mac) unices deal with this is seriously broken. There
is *no* guarantee the actual input paths are the encoding claimed by
the locale settings.

There is no way for APR to solve that issue. The only thing it can do
is tell the application which input it should expect. Subversion
offers conversion routines to do the actual "locale"->UTF8 path
conversion since Subversion actually *is* UTF8 "inside", meaning that
it's ok for Subversion to err when it encounters invalid (ie non-UTF8)
input. Not all APR applications may find that desirable (for example:
Apache httpd doesn't initialise locale settings, so, it can't do
locale->utf8 conversions [as the C runtime doesn't know what the
current locale is]; nor will it change that behaviour.)

HTH,

Erik.

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: UTF-8 support for Unix with APR?

Posted by Vincent Lefevre <vi...@vinc17.org>.
On 2008-02-12 13:52:41 +0100, Erik Huelsmann wrote:
> The APR libraries handle file paths in the system locale. This means
> they *may* be encoded in UTF-8, but are not necessarily. When they are
> interpreted as UTF-8 depends on the LANG or LC_CTYPE settings in the
> host environment.

This is broken. APR should switch to UTF-8 locales internally when it
deals with filenames (like what GNOME apps do). Otherwise this leads
to consistency problems when the user has both ISO-8859-1 and UTF-8
terminal sessions (the reason is that some applications and/or some
machines do not support multibyte character sets, and one wouldn't
want to mess everything when running svn in degraded mode, i.e. with
ISO-8859-1 locales).

-- 
Vincent Lefèvre <vi...@vinc17.org> - Web: <http://www.vinc17.org/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.org/blog/>
Work: CR INRIA - computer arithmetic / Arenaire project (LIP, ENS-Lyon)

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: UTF-8 support for Unix with APR?

Posted by Erik Huelsmann <eh...@gmail.com>.
On 2/12/08, B. Blodau <b_...@hamburg.de> wrote:
> I wrote a similar request for this topic earlier, but now it becomes
> a more general issue.
>
> My questions are:
> - Can subversion support utf8 filenames on Unix systems when using
> the apr libraries?
> - Has anybody used the C-Libraries on a unix system (including MacOS
> X) and successfully used international pathnames?
>
> I'm writing a C++ appliation which uses the svn libraries and
> therefore inherits the apr libraries too.
> When working with international filenames I'm getting errors that
> characters could not be converted from utf8 to the local encoding.
>
> Since my app is a Unicode safe application I don't want the filenames
> be converted, because I want to keep the whole Unicode character set.
> Even a successful conversion to the encoding of the current user
> locale, would result in a limited character set.
>
> When debugging this a bit futher I came to the follwoing code snippet in
> ".../apr/file_io/unix/filepath.c":
>
> APR_DECLARE(apr_status_t) apr_filepath_encoding(int *style,
> apr_pool_t *p)
> {
>     *style = APR_FILEPATH_ENCODING_LOCALE;
>     return APR_SUCCESS;
> }
>
> This looks as if - at least for Unix - no utf8 support is intended.
> Otherwise this function should return APR_FILEPATH_ENCODING_UTF8.

The APR libraries handle file paths in the system locale. This means
they *may* be encoded in UTF-8, but are not necessarily. When they are
interpreted as UTF-8 depends on the LANG or LC_CTYPE settings in the
host environment.

LANG=en_US.UTF-8

will indicate UTF-8 pathnames. OTOH,

LANG=en_US.iso8859-1

will indicate "latin1" pathnames. The application is free to do with
that information whatever it wants. Subversion uses the returned value
to determine whether there's any "locale"->"UTF8" conversion
necessary, since internally it entirely uses UTF8 encoded pathnames.

> Can anybody confirm my concern? I just don't want to search for a
> solution where I don't have any chance.

If you encounter conversion problems, chances are you didn't provide
any information to your application regarding the locale settings to
be used: only when you do that, then will Subversion know what the
source encoding to be used is (and it won't convert if the source is
already considered UTF8). Did you call setlocale() (the C library
function)?

bye,


Erik.

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org