You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@httpd.apache.org by Greg Stein <gs...@lyra.org> on 2001/02/25 00:27:45 UTC

unicode file APIs (was: Re: canonical stuff)

On Sat, Feb 24, 2001 at 11:31:49AM -0600, William A. Rowe, Jr. wrote:
> From: "Greg Stein" <gs...@lyra.org>
> Sent: Saturday, February 24, 2001 3:44 AM
>...
> > In a similar vein, when you added all that Unicode stuff, it just kind of
> > dropped into the code. No big deal as it was all Win32 specific (i.e. it
> > didn't affect my playground), but it was an awfully big change. Especially
> > in the semantics. We still haven't refactored the API into two sets of
> > functions (one for Unicode chars, one for 8-bit native).
> 
> I'm absolutely positively near certain we won't.  Please let me explain.
>
> ... lot of stuff about why Unicode filenames are Goodness ...

I don't disagree with wanting Unicode filenames. I completely disagree with
APIs that change their semantics based on the platform they are compiled on.

If I have an application that I desire to be portable, then I'm going to use
APR to do it. In my app, I call apr_file_open(some_8bit_name). That should
work on all platforms. With the current single API, it will break on NT when
compiled with the Unicode stuff.

None of the APIs change their semantics. They exist or they don't, but they
don't change.

The answer is to have apr_file_open_u() for opening with Unicode filenames,
not changing the encoding of the existing apr_file_open. You completely
break all possibility of writing portable apps when you do that. And APR is
*about* writing portable apps.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/

Re: unicode file APIs (was: Re: canonical stuff)

Posted by "William A. Rowe, Jr." <wr...@rowe-clan.net>.

[Moved strictly to dev@apr.apache.org - since this seems to _not_ be a discussion
of apache, but primarily of an API for other APR users.]

From: "Greg Stein" <gs...@lyra.org>
Sent: Saturday, February 24, 2001 5:27 PM

> On Sat, Feb 24, 2001 at 11:31:49AM -0600, William A. Rowe, Jr. wrote:
> > From: "Greg Stein" <gs...@lyra.org>
> > Sent: Saturday, February 24, 2001 3:44 AM
> >...
> > > In a similar vein, when you added all that Unicode stuff, it just kind of
> > > dropped into the code. No big deal as it was all Win32 specific (i.e. it
> > > didn't affect my playground), but it was an awfully big change. Especially
> > > in the semantics. We still haven't refactored the API into two sets of
> > > functions (one for Unicode chars, one for 8-bit native).
> > 
> > I'm absolutely positively near certain we won't.  Please let me explain.
> >
> > ... lot of stuff about why Unicode filenames are Goodness ...
> 
> I don't disagree with wanting Unicode filenames. I completely disagree with
> APIs that change their semantics based on the platform they are compiled on.

And I'm _arguing_ that the semantics _do_ change, regardless of APR_HAS_UNICODE_FS.

Simply put - Win32 has a restricted set of characters.  Not only is it a restricted
set of characters, but alpha chars map from upper to lower case in very unpredictable
ways.  By unpredictable, I mean that the clib tolower()/toupper() _never_ matches the
mappings that the Win32 filesystem performs.  That's a very nasty side effect that
isn't really very tollerable.  Of course, we also eliminate a number of symbols on
Win32 that simply aren't supported, but are perfectly legal on Unix.

OTOH, spaces are not a problem, as they seem to be for Unix.

> If I have an application that I desire to be portable, then I'm going to use
> APR to do it. In my app, I call apr_file_open(some_8bit_name). That should
> work on all platforms. With the current single API, it will break on NT when
> compiled with the Unicode stuff.

What is portable here?  There is nothing portable about high-bit characters.
Other than opaque data, you can't make many assumptions about them without an
API that we haven't defined for APR.  Not that we shouldn't.  Not that it shouldn't
map the characters appropriately for _whatever_ code page the user desires.  But
it simply doesn't parse.  Local code pages are not effective for file naming, for
most applications, unless more information is known about the system.  We don't have
a way to provide that information.

> None of the APIs change their semantics. They exist or they don't, but they
> don't change.
> 
> The answer is to have apr_file_open_u() for opening with Unicode filenames,
> not changing the encoding of the existing apr_file_open. You completely
> break all possibility of writing portable apps when you do that. And APR is
> *about* writing portable apps.

What does apr_file_open_u() do on Unix?  I would expect, nothing.  Unless you have
a utf-8 build of unix (which there are) this is pretty meaningless.  But what _if_
the user is building apr under a utf-8 powered unix?  Is the filename Bite%x81Me.txt
accepted?  I can't answer the question.  What happens if it is accepted and created?
What does ls Bite* do?  That character alone is a continuation character with no lead
byte.  Does ls show anything worthwhile?

I'm saying stop even looking at Win32 for 10 minutes, and examine the bigger issues
that allow this to become a cross-platform API.  Then we can begin the process of
determining an _appropriate_ api to cover these issues.

There is nothing that says _any_ filesystem accepts high bit characters, except that
some do.  How can we relate this to the user and the coder?  I don't have an answer,
I simply believe that apr_functions_u() this anything but the common denominator.

Bill

Re: unicode file APIs (was: Re: canonical stuff)

Posted by "William A. Rowe, Jr." <wr...@rowe-clan.net>.

From: "dean gaudet" <dg...@arctic.org>
Sent: Sunday, February 25, 2001 7:42 PM


> i'm a bit of an I18N novice, but doesn't it all just magically work if you
> use UTF-8 encoding everywhere?
>
> UTF-8 deliberately avoids using \0 and / in the encodings.  plain ascii
> works unmodified.  unix filesystems generally support UTF-8 directly
> (because of the \0 and / avoidance).
>
> this allows you to have a single API which understands unicode on all
> platforms -- you don't need to have _u versions which take unicode
> strings.

You are understanding exactly what I proposed with APR_HAS_UNICODE_FS.
My only small change is a way to get config directives in with wchar
support.  Since Win32 has no utf-8 editor, I'm working out the patch
to recognize the lead word of a unicode stream and switch to unicode
to utf-8 conversion.  Even notepad on Win32 supports unicode files, so
this becomes a no-brainer for administrators.

> give this page a perusal:  http://www.cl.cam.ac.uk/~mgk25/unicode.html

I especially liked a comment from http://www.cl.cam.ac.uk/~mgk25/unicode.html#linux

a.. External file system drivers such as VFAT and WinNT have to convert file name character encodings. UTF-8 has to be added to the
list of already available conversion options, and the mount command has to tell the kernel driver that user processes shall see
UTF-8 file names. Since VFAT and WinNT use already Unicode anyway, UTF-8 has the advantage of guaranteeing a lossless conversion
here.

My key concept is _lossless_.  All SomeWin32FunctionA() variants are lossy, and
their encoding doesn't correspond to MS's own clib [we can comment on their lack
of brain cells here ... but we won't.]  All SomeWin32FunctionW() variants are
not only lossless, but faster.  Obviously we replace their conversion cycles
from local code page to unicode with our own utf-8 to unicode functions, but that
shouldn't (if I succeeded) add any net CPU cycles.

Of course they don't correspond to the clib functions [e.g. - consider strlen()]
but we are damned if we do... damned if we don't.  mod_autoindex obviously needs
to see APR_IS_UNICODE_FS and adjust the width accordingly.  We will get there, but
we aren't there yet.

If we support the native narrow characters we need an effective API to do so
[should we use the current ansi code page or the current oem code page?]  We didn't
have a respectable design, and this change made all those other issues mute.

Re: unicode file APIs (was: Re: canonical stuff)

Posted by "William A. Rowe, Jr." <wr...@rowe-clan.net>.

From: "dean gaudet" <dg...@arctic.org>
Sent: Sunday, February 25, 2001 7:42 PM


> i'm a bit of an I18N novice, but doesn't it all just magically work if you
> use UTF-8 encoding everywhere?
>
> UTF-8 deliberately avoids using \0 and / in the encodings.  plain ascii
> works unmodified.  unix filesystems generally support UTF-8 directly
> (because of the \0 and / avoidance).
>
> this allows you to have a single API which understands unicode on all
> platforms -- you don't need to have _u versions which take unicode
> strings.

You are understanding exactly what I proposed with APR_HAS_UNICODE_FS.
My only small change is a way to get config directives in with wchar
support.  Since Win32 has no utf-8 editor, I'm working out the patch
to recognize the lead word of a unicode stream and switch to unicode
to utf-8 conversion.  Even notepad on Win32 supports unicode files, so
this becomes a no-brainer for administrators.

> give this page a perusal:  http://www.cl.cam.ac.uk/~mgk25/unicode.html

I especially liked a comment from http://www.cl.cam.ac.uk/~mgk25/unicode.html#linux

a.. External file system drivers such as VFAT and WinNT have to convert file name character encodings. UTF-8 has to be added to the
list of already available conversion options, and the mount command has to tell the kernel driver that user processes shall see
UTF-8 file names. Since VFAT and WinNT use already Unicode anyway, UTF-8 has the advantage of guaranteeing a lossless conversion
here.

My key concept is _lossless_.  All SomeWin32FunctionA() variants are lossy, and
their encoding doesn't correspond to MS's own clib [we can comment on their lack
of brain cells here ... but we won't.]  All SomeWin32FunctionW() variants are
not only lossless, but faster.  Obviously we replace their conversion cycles
from local code page to unicode with our own utf-8 to unicode functions, but that
shouldn't (if I succeeded) add any net CPU cycles.

Of course they don't correspond to the clib functions [e.g. - consider strlen()]
but we are damned if we do... damned if we don't.  mod_autoindex obviously needs
to see APR_IS_UNICODE_FS and adjust the width accordingly.  We will get there, but
we aren't there yet.

If we support the native narrow characters we need an effective API to do so
[should we use the current ansi code page or the current oem code page?]  We didn't
have a respectable design, and this change made all those other issues mute.

Re: unicode file APIs (was: Re: canonical stuff)

Posted by Sander van Zoest <sa...@covalent.net>.

On Sun, 25 Feb 2001, dean gaudet wrote:

> > The answer is to have apr_file_open_u() for opening with Unicode filenames,
> > not changing the encoding of the existing apr_file_open. You completely
> > break all possibility of writing portable apps when you do that. And APR is
> > *about* writing portable apps.
> i'm a bit of an I18N novice, but doesn't it all just magically work if you
> use UTF-8 encoding everywhere?
> 
> UTF-8 deliberately avoids using \0 and / in the encodings.  plain ascii
> works unmodified.  unix filesystems generally support UTF-8 directly
> (because of the \0 and / avoidance).
> 
> this allows you to have a single API which understands unicode on all
> platforms -- you don't need to have _u versions which take unicode
> strings.
> 
> give this page a perusal:  http://www.cl.cam.ac.uk/~mgk25/unicode.html

i18n can be kind of pain when you need to convert data that you do not
know the charset for or is data you do not control.

Going to a fully ISO-10646 (UTF-8) system would kill all the issues,
but the problem is making that migration and converting everything. This
is where there isn't too much code out there that does all the mappings.

I do think, as wrowe points out, this probably should be handled inside
APR, so this way apache can handle as much as possible in ISO-10646, 
especially if everything it interacts with supports it.

Now the problem comes in when you deal with non 10646 stuff outside of
the ASCII and latin1 charsets when you have a 10646 based server. You
need to convert somehow and if we convert to UTF-8 via iconv then I
do not see an issue.

--
Sander van Zoest                                         [sander@covalent.net]
Covalent Technologies, Inc.                           http://www.covalent.net/
(415) 536-5218                                 http://www.vanzoest.com/sander/

Re: unicode file APIs (was: Re: canonical stuff)

Posted by dean gaudet <dg...@arctic.org>.

i'm a bit of an I18N novice, but doesn't it all just magically work if you
use UTF-8 encoding everywhere?

UTF-8 deliberately avoids using \0 and / in the encodings.  plain ascii
works unmodified.  unix filesystems generally support UTF-8 directly
(because of the \0 and / avoidance).

this allows you to have a single API which understands unicode on all
platforms -- you don't need to have _u versions which take unicode
strings.

give this page a perusal:  http://www.cl.cam.ac.uk/~mgk25/unicode.html

-dean

On Sat, 24 Feb 2001, Greg Stein wrote:

> On Sat, Feb 24, 2001 at 11:31:49AM -0600, William A. Rowe, Jr. wrote:
> > From: "Greg Stein" <gs...@lyra.org>
> > Sent: Saturday, February 24, 2001 3:44 AM
> >...
> > > In a similar vein, when you added all that Unicode stuff, it just kind of
> > > dropped into the code. No big deal as it was all Win32 specific (i.e. it
> > > didn't affect my playground), but it was an awfully big change. Especially
> > > in the semantics. We still haven't refactored the API into two sets of
> > > functions (one for Unicode chars, one for 8-bit native).
> >
> > I'm absolutely positively near certain we won't.  Please let me explain.
> >
> > ... lot of stuff about why Unicode filenames are Goodness ...
>
> I don't disagree with wanting Unicode filenames. I completely disagree with
> APIs that change their semantics based on the platform they are compiled on.
>
> If I have an application that I desire to be portable, then I'm going to use
> APR to do it. In my app, I call apr_file_open(some_8bit_name). That should
> work on all platforms. With the current single API, it will break on NT when
> compiled with the Unicode stuff.
>
> None of the APIs change their semantics. They exist or they don't, but they
> don't change.
>
> The answer is to have apr_file_open_u() for opening with Unicode filenames,
> not changing the encoding of the existing apr_file_open. You completely
> break all possibility of writing portable apps when you do that. And APR is
> *about* writing portable apps.
>
> Cheers,
> -g
>
> --
> Greg Stein, http://www.lyra.org/
>

Re: unicode file APIs (was: Re: canonical stuff)

Posted by dean gaudet <dg...@arctic.org>.

i'm a bit of an I18N novice, but doesn't it all just magically work if you
use UTF-8 encoding everywhere?

UTF-8 deliberately avoids using \0 and / in the encodings.  plain ascii
works unmodified.  unix filesystems generally support UTF-8 directly
(because of the \0 and / avoidance).

this allows you to have a single API which understands unicode on all
platforms -- you don't need to have _u versions which take unicode
strings.

give this page a perusal:  http://www.cl.cam.ac.uk/~mgk25/unicode.html

-dean

On Sat, 24 Feb 2001, Greg Stein wrote:

> On Sat, Feb 24, 2001 at 11:31:49AM -0600, William A. Rowe, Jr. wrote:
> > From: "Greg Stein" <gs...@lyra.org>
> > Sent: Saturday, February 24, 2001 3:44 AM
> >...
> > > In a similar vein, when you added all that Unicode stuff, it just kind of
> > > dropped into the code. No big deal as it was all Win32 specific (i.e. it
> > > didn't affect my playground), but it was an awfully big change. Especially
> > > in the semantics. We still haven't refactored the API into two sets of
> > > functions (one for Unicode chars, one for 8-bit native).
> >
> > I'm absolutely positively near certain we won't.  Please let me explain.
> >
> > ... lot of stuff about why Unicode filenames are Goodness ...
>
> I don't disagree with wanting Unicode filenames. I completely disagree with
> APIs that change their semantics based on the platform they are compiled on.
>
> If I have an application that I desire to be portable, then I'm going to use
> APR to do it. In my app, I call apr_file_open(some_8bit_name). That should
> work on all platforms. With the current single API, it will break on NT when
> compiled with the Unicode stuff.
>
> None of the APIs change their semantics. They exist or they don't, but they
> don't change.
>
> The answer is to have apr_file_open_u() for opening with Unicode filenames,
> not changing the encoding of the existing apr_file_open. You completely
> break all possibility of writing portable apps when you do that. And APR is
> *about* writing portable apps.
>
> Cheers,
> -g
>
> --
> Greg Stein, http://www.lyra.org/
>