You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@httpd.apache.org by Dean Gaudet <dg...@arctic.org> on 1997/12/19 09:42:02 UTC

"locale" project

Here's something that needs to be done if someone is interested.  Apache
abuses the heck out of the locale specific functions.  Its usage is not
compliant with POSIX/ANSI/whateveryouwant.  It doesn't behave well when a
module wants to use a different locale, or when the admin wants to set a
different locale.  Specific examples: 

- (PR#1305, PR#1450) We use isalpha(), isalnum(), etc. functions with type
"char" arguments.  Strictly speaking we can only do that if we test
isascii() first.  But an alternative which is preferable, because it's
less expensive (performance), is to use unsigned chars (almost) 
everywhere. 

The ANSI C standard allows an implementation to choose whether an
unqualified char is signed or unsigned.  Most compilers default to signed,
but have an option for unsigned (or vice versa).  "gcc -funsigned-char" 
for example makes "char" an unsigned char.  My suggestion is that we
"typedef unsigned char uchar;" and "typedef signed char schar;" and use
those everywhere in place of char.  It's quite possible that we'll never
need schar (the only cases we'd need schar are those where we're using
chars as a 1-byte signed integer, which in apache should be non-existant). 

I suspect there will be compiler warning issues because C library
functions use naked "char".  The workaround will likely to be to use "gcc
-funsigned-char -Wall", or whatever the equiv is on whatever other ANSI
compiler we use.  Note that this is just a warning issue -- not
necessarily a correctness issue.  (Although the library could be whacked.) 

- (PR#76, #679) We use locale-specific functions (isalpha(), isalnum(),
strftime(), and on and on) without setting the locale.  There is a
conflict here between Apache wanting to use some functions in a specific
way (i.e. assuming the "C" locale) and some modules wanting to let the
user do things in their locale of choice (i.e. mod_php).

A big worry here is that setlocale() is an expensive function.  On
Solaris, for example, it involves reading a file off disk.  So we can't
just switch locales at will.  It's unfortunate, but it's not possible for
a POSIX program to exist in two locales at once.

- (PR#754) struct tm does not include a time zone on all systems, we
assume it does.  It's entirely possible that we'll need our own strftime() 
replacement.

- I'm certain we have assumptions that isalpha(c) is the same as c == 'a'
|| c == 'b' || ... || c == 'z' (and the upper case letters).  When in
truth, isalpha(c) can also be true for various 8-bit characters in most
non-"C" locales.  For example, in "ISO-8859-1", isalpha(192) is TRUE. 
This has to be fixed. 

I'd say that if someone does take this on, they'll have to possibly
maintain this as a patch through 1.x versions into 2.x versions.  The
changes are likely too large to make 1.3.  I would like them to be part of
1.x, but until we see how far reaching they are it's hard to say if we can
fit it into the "schedule".  It's unlikely they can make 1.3.0. 

A good approach would be to research the problem first... and report back
how you plan to deal with it all.  We'll all argue it (we always do, it's
better to argue before you write code and possibly waste your time).  Then
write the code. 

Dean



Re: "locale" project

Posted by Dean Gaudet <dg...@arctic.org>.
Oh it's easy to do that efficiently -- it's just a matter of reading the
data once when you create the locale object, we'd do that at config time. 
It's nice to know they're dealing with this... it's unfortunate it's in
C++. 

Dean

On Fri, 19 Dec 1997, Ingo Luetkebohle wrote:

> Just piping in here...
> 
> áÎÄÒÅÊ þÅÒÎÏ× wrote:
> > > > > A big worry here is that setlocale() is an expensive function.  On
> > > > > Solaris, for example, it involves reading a file off disk.  So we can't
> > > > > just switch locales at will.  It's unfortunate, but it's not possible for
> > > > > a POSIX program to exist in two locales at once.
> 
> Well, the new C++ draft standard allows you to use multiple locales at
> once, intermixed, in the same process. I don´t know how its implemented,
> but it might be worth a look -- I doubt they´d put something inherently
> slow into the standard library.
> 
> ---Ingo Luetkebohle
> dev/consulting Gesellschaft fuer Netzwerkentwicklung und -beratung mbH
> url: http://www.devconsult.de/ - fon: 0521-1365800 - fax: 0521-1365803
> 


Re: "locale" project

Posted by Ingo Luetkebohle <in...@blank.pages.de>.
Just piping in here...

áÎÄÒÅÊ þÅÒÎÏ× wrote:
> > > > A big worry here is that setlocale() is an expensive function.  On
> > > > Solaris, for example, it involves reading a file off disk.  So we can't
> > > > just switch locales at will.  It's unfortunate, but it's not possible for
> > > > a POSIX program to exist in two locales at once.

Well, the new C++ draft standard allows you to use multiple locales at
once, intermixed, in the same process. I don´t know how its implemented,
but it might be worth a look -- I doubt they´d put something inherently
slow into the standard library.

---Ingo Luetkebohle
dev/consulting Gesellschaft fuer Netzwerkentwicklung und -beratung mbH
url: http://www.devconsult.de/ - fon: 0521-1365800 - fax: 0521-1365803

Re: "locale" project

Posted by ra...@bellglobal.com.
> An alternative is to say to these users "tough, C is all we support".  Can
> some of the mod_php folk explain maybe why users need the locale
> functionality?  I think I know why... but a specific example would be good
> :)

The most common use is PHP's StrToUpper()/Lower() functions.  The shift
information for certain languages is not handled properly in the C locale
and thus make these functions useless unless it is possible to change the
locale.

-Rasmus



Re: "locale" project

Posted by Ben Laurie <be...@algroup.co.uk>.
Martin Kraemer wrote:
> 
> On Fri, Dec 19, 1997 at 12:17:28PM -0800, Dean Gaudet wrote:
> > You can find similar lame examples using strncpy,
> > and strncat.
> 
> Yes, but we could certainly eliminate assignments like
> 
>     strncpy(server_root, HTTPD_ROOT, sizeof(server_root) - 1);
> 
> today. What use is there in filling in the HTTPD_ROOT, and then filling
> up all the rest (up to HUGE_STRING_LEN-sizeof(HTTPD_ROOT) bytes) with
> zero characters? The functionality of strncpy() is totally broken: the
> only use for it is in the directory entry fill-in code of the
> 14-char-length Sys5 file system (there, the file names must be padded to
> length 14 with binary zeros, unless their length is 14 already. But
> probably, that's also the only place where it _isn't_ used...). For
> everyday use, strncpy() is the wrong candidate. What's needed is a
> "bounded strcpy()", something like:
>     strncat(strcpy(dest,""), srce, length);

Good grief. I'd never noticed that strncpy did zero fill!

Cheers,

Ben.

-- 
Ben Laurie            |Phone: +44 (181) 735 0686|Apache Group member
Freelance Consultant  |Fax:   +44 (181) 735 0689|http://www.apache.org
and Technical Director|Email: ben@algroup.co.uk |Apache-SSL author
A.L. Digital Ltd,     |http://www.algroup.co.uk/Apache-SSL
London, England.      |"Apache: TDG" http://www.ora.com/catalog/apache

Re: "locale" project

Posted by Dean Gaudet <dg...@arctic.org>.

On Mon, 22 Dec 1997, Martin Kraemer wrote:

> Yes, but we could certainly eliminate assignments like
> 
>     strncpy(server_root, HTTPD_ROOT, sizeof(server_root) - 1);
> 
> today. What use is there in filling in the HTTPD_ROOT, and then filling
> up all the rest (up to HUGE_STRING_LEN-sizeof(HTTPD_ROOT) bytes) with
> zero characters? The functionality of strncpy() is totally broken: the
> only use for it is in the directory entry fill-in code of the
> 14-char-length Sys5 file system (there, the file names must be padded to
> length 14 with binary zeros, unless their length is 14 already. But
> probably, that's also the only place where it _isn't_ used...). For
> everyday use, strncpy() is the wrong candidate. What's needed is a
> "bounded strcpy()", something like:
>     strncat(strcpy(dest,""), srce, length);

Oh wow.  I never knew this.  This is such a load of crap... yes let's make
a useful strncpy.  One which properly \0 terminates in all cases as well. 

Dean


Re: "locale" project

Posted by Martin Kraemer <Ma...@mch.sni.de>.
On Fri, Dec 19, 1997 at 12:17:28PM -0800, Dean Gaudet wrote:
> You can find similar lame examples using strncpy,
> and strncat.

Yes, but we could certainly eliminate assignments like

    strncpy(server_root, HTTPD_ROOT, sizeof(server_root) - 1);

today. What use is there in filling in the HTTPD_ROOT, and then filling
up all the rest (up to HUGE_STRING_LEN-sizeof(HTTPD_ROOT) bytes) with
zero characters? The functionality of strncpy() is totally broken: the
only use for it is in the directory entry fill-in code of the
14-char-length Sys5 file system (there, the file names must be padded to
length 14 with binary zeros, unless their length is 14 already. But
probably, that's also the only place where it _isn't_ used...). For
everyday use, strncpy() is the wrong candidate. What's needed is a
"bounded strcpy()", something like:
    strncat(strcpy(dest,""), srce, length);

    Martin
-- 
| S I E M E N S |  <Ma...@mch.sni.de>  |      Siemens Nixdorf
| ------------- |   Voice: +49-89-636-46021     |  Informationssysteme AG
| N I X D O R F |   FAX:   +49-89-636-44994     |   81730 Munich, Germany
~~~~~~~~~~~~~~~~My opinions only, of course; pgp key available on request

Re: "locale" project

Posted by Dean Gaudet <dg...@arctic.org>.
On Fri, 19 Dec 1997, [KOI8-R] áÎÄÒÅÊ þÅÒÎÏ× wrote:

> On Fri, 19 Dec 1997, Dean Gaudet wrote:
> 
> > move this way as well... and as we replace string and alloc functions we
> > venture further and further away from libc.
> 
> Just two examples from FreeBSD libc:
> string functions are mostly in assembler and malloc functions effecttively
> talk to kernel VM system to give more optimization and returning memory
> back to VM. All this features will lost with re-implementing :-(
> Ok, ok, I keep silence...

Ah, but writing strcpy and strcat in assembler does not necessary lead to
improved performance.  Consider this (assume there's no buffer overflows):

    void the_libc_way(char *d, char *s1, char *s2)
    {
	strcpy(d, s1);
	strcat(d, s2);
    }

    /* here's a function that I found in Lattice C for the Amiga libc,
     * it's strcpy, but returns a pointer to the NUL-terminator of d,
     * which is about 2000 times more useful than returning d.
     */
    char *stpcpy(char *d, const char *s)
    {
	while((*d = *s)) {
	    ++d;
	    ++s;
	}
	return d;
    }

    void a_faster_way(char *d, char *s1, char *s2)
    {
	strcpy(stpcpy(d, s1), s2);
    }

The first does 2*|s1|+|s2| work, the second does |s1|+|s2| work.

i.e. optimizing libc isn't going to make the lame code that's written to
use libc go faster.  You can find similar lame examples using strncpy,
and strncat.  Another case that causes extreme lameness is the \0
terminated string itself -- which forces you to make duplicates of
things just so that you can get a token which is \0-terminated to pass
to another routine.  If strings were struct { char *p; size_t len }
this copying wouldn't have to happen.

BTW we still use malloc(), we just wrap it in a bunch of our own stuff.  We
don't ever free() though.

Dean


Re: "locale" project

Posted by Андрей Чернов <ac...@nagual.pp.ru>.
On Fri, 19 Dec 1997, Dean Gaudet wrote:

> Yeah... you're right.  But we could provide locale C specific functions --
> and maybe we can even do it in a way that a server can be built with the
> option of using only locale C, or using C and one other. 

A program already runs with "C" locale by default unless it calls
setlocale() directly, so you don't need any functions to implement "C" 
locale, it is already here. 

For "other" locale you need to relay on libc setlocale() and others or
implement Apache's own setlocale() family with data set for each locale
users want to use. Also you need re-implement affected functions such as
strcoll, strxfrm, strftime, whole ctype, glob and regexp and maybe printf. 

> move this way as well... and as we replace string and alloc functions we
> venture further and further away from libc.

Just two examples from FreeBSD libc:
string functions are mostly in assembler and malloc functions effecttively
talk to kernel VM system to give more optimization and returning memory
back to VM. All this features will lost with re-implementing :-(
Ok, ok, I keep silence...

-- 
Andrey A. Chernov
<ac...@nietzsche.net>
http://www.nagual.pp.ru/~ache/


Re: "locale" project

Posted by Dean Gaudet <dg...@arctic.org>.

On Fri, 19 Dec 1997, [KOI8-R] áÎÄÒÅÊ þÅÒÎÏ× wrote:

> In first place I don't understand why Apache itself (not modules) need to
> be under locale != "C", do you have some examples? I.e. if you need just
> localized strftime() output just fork once with specific locale at the
> Apache startup and pass all locale-specific requests to forked process. 
> Shared memory or mmap gives almost no overhead in this situation.  In this
> model we have just one process per locale used. I am against runtime
> locales switching inside Apache core because it cause too much overhead. 

I'm against runtime switching of locales as well.  In fact all we need is
the C locale and whatever other locale the local user wants -- I listed
two PRs where the user wants their own locale for things like mod_php and
mod_include when formating dates for sending to the client.  Since we need
only C and "one other" locale I think we could get away with implementing
specifically the set of locale functions we use, and using the system's
own setlocale() for whatever "true" locale the user wants to use.  What a
mess.

An alternative is to say to these users "tough, C is all we support".  Can
some of the mod_php folk explain maybe why users need the locale
functionality?  I think I know why... but a specific example would be good
:)

> > The C library is poorly designed, POSIX isn't helping it at all.  We
> > pretty much have to replace all the string and allocation functions
> > because we need better resource management and more functionality.  It's a
> > similar step to start replacing the locale functions because we need
> > better locale management. 
> 
> It is almost impossible to replace locale functions, although locale calls
> are mostly standartized per POSIX, even call arguments (locale names) are
> not standartized and locale data itself is _very_ different from system to
> system...

Yeah... you're right.  But we could provide locale C specific functions --
and maybe we can even do it in a way that a server can be built with the
option of using only locale C, or using C and one other. 

> I understand your portability intention but it leades to Apache Operating
> System(TM) as result running on dedicated machine with nothing else. It
> seems there is no good solution of this problem.  I hope it not happens in
> nearest future....

It sucks... but we're not the first to go down this road.  qmail uses
essentially nothing from man section 3, it has a library of its own which
is more suited to writing secure applications.  I'd like to see Apache
move this way as well... and as we replace string and alloc functions we
venture further and further away from libc.

> > One more thing to add:  locale is a global setting, in a threaded port we
> > can't switch locales at all. 
> 
> Maybe will be good to ask someone from POSIX committie about their
> locale-related plans.

Yeah, but I always get confused by all the different subcommittees. 

Hey C++ folks, is this solved in any magic C++ class library?  Ick, I
don't want to bloat us further by going to C++ ;) 

Dean


Re: "locale" project

Posted by Андрей Чернов <ac...@nagual.pp.ru>.
On Fri, 19 Dec 1997, Dean Gaudet wrote:

> > > A big worry here is that setlocale() is an expensive function.  On
> > > Solaris, for example, it involves reading a file off disk.  So we can't
> > > just switch locales at will.  It's unfortunate, but it's not possible for
> > > a POSIX program to exist in two locales at once.
> > 
> > One of the ways can be fork as many times as locales used and do all
> > locale-specific work in subprocesses.
> 
> This is apache we're talking about, performance is a huge issue.  I don't
> think we wan't to be forking more processes just to be doing things like
> strftime().

In first place I don't understand why Apache itself (not modules) need to
be under locale != "C", do you have some examples? I.e. if you need just
localized strftime() output just fork once with specific locale at the
Apache startup and pass all locale-specific requests to forked process. 
Shared memory or mmap gives almost no overhead in this situation.  In this
model we have just one process per locale used. I am against runtime
locales switching inside Apache core because it cause too much overhead. 

> The C library is poorly designed, POSIX isn't helping it at all.  We
> pretty much have to replace all the string and allocation functions
> because we need better resource management and more functionality.  It's a
> similar step to start replacing the locale functions because we need
> better locale management. 

It is almost impossible to replace locale functions, although locale calls
are mostly standartized per POSIX, even call arguments (locale names) are
not standartized and locale data itself is _very_ different from system to
system...

I understand your portability intention but it leades to Apache Operating
System(TM) as result running on dedicated machine with nothing else. It
seems there is no good solution of this problem.  I hope it not happens in
nearest future....

> One more thing to add:  locale is a global setting, in a threaded port we
> can't switch locales at all. 

Maybe will be good to ask someone from POSIX committie about their
locale-related plans.

-- 
Andrey A. Chernov
<ac...@nietzsche.net>
http://www.nagual.pp.ru/~ache/


Re: "locale" project

Posted by Dean Gaudet <dg...@arctic.org>.
On Fri, 19 Dec 1997, [KOI8-R] áÎÄÒÅÊ þÅÒÎÏ× wrote:

> On Fri, 19 Dec 1997, Dean Gaudet wrote:
> 
> > I suspect there will be compiler warning issues because C library
> > functions use naked "char".  The workaround will likely to be to use "gcc
> > -funsigned-char -Wall", or whatever the equiv is on whatever other ANSI
> > compiler we use.  Note that this is just a warning issue -- not
> > necessarily a correctness issue.  (Although the library could be whacked.) 
> 
> It is the reason why "typedef ... uchar" is not really needed, because it
> be backed by -funsigned-char in anycase as workaround of massive warnings.
> -funsigned-char alone seems enough.

Apache can be compiled by an arbitrary ANSI C compiler, not just gcc, so
this isn't a complete solution. 

> > A big worry here is that setlocale() is an expensive function.  On
> > Solaris, for example, it involves reading a file off disk.  So we can't
> > just switch locales at will.  It's unfortunate, but it's not possible for
> > a POSIX program to exist in two locales at once.
> 
> One of the ways can be fork as many times as locales used and do all
> locale-specific work in subprocesses.

This is apache we're talking about, performance is a huge issue.  I don't
think we wan't to be forking more processes just to be doing things like
strftime().

> > - (PR#754) struct tm does not include a time zone on all systems, we
> > assume it does.  It's entirely possible that we'll need our own strftime() 
> > replacement.
> 
> Oh, no! Apache already replaces too many standard functions. Is is time
> now for Apache-libc? Lets do it only for systems which really needs them,
> not for all systems as for snprintf f.e.

You and I have argued this one before -- you want a tiny memory image on
FreeBSD where you happen to have most of the functionality current apache
wants.  I (and others) want portable code.  Apache replaces functions that
are broken or missing on enough systems that we can't ensure complete
portability.  We value portability -- we don't just want to run on freebsd
and linux.  I think in this case the C library is sorely lacking -- it
should be possible on a per-call basis to supply a locale. 

The systems I deal with can easily handle the extra overlap of 100k or
200k library code.  Many webservers are dedicated machines.  Apache isn't
about to start entering the embedded systems market, our code is bloated
already. 

The C library is poorly designed, POSIX isn't helping it at all.  We
pretty much have to replace all the string and allocation functions
because we need better resource management and more functionality.  It's a
similar step to start replacing the locale functions because we need
better locale management. 

> > - I'm certain we have assumptions that isalpha(c) is the same as c == 'a'
> > || c == 'b' || ... || c == 'z' (and the upper case letters).  When in
> > truth, isalpha(c) can also be true for various 8-bit characters in most
> > non-"C" locales.  For example, in "ISO-8859-1", isalpha(192) is TRUE. 
> > This has to be fixed. 
> 
> Unless we switch locale from "C", this assumption is true, but if we
> switch somewhere, it is false. I don't think that Apache as daemon must
> run itself under locale different than "C", but forked modules can switch
> locale instead.

We don't fork modules. 

One more thing to add:  locale is a global setting, in a threaded port we
can't switch locales at all. 

Dean


Re: "locale" project

Posted by Андрей Чернов <ac...@nagual.pp.ru>.
On Fri, 19 Dec 1997, Dean Gaudet wrote:

> I suspect there will be compiler warning issues because C library
> functions use naked "char".  The workaround will likely to be to use "gcc
> -funsigned-char -Wall", or whatever the equiv is on whatever other ANSI
> compiler we use.  Note that this is just a warning issue -- not
> necessarily a correctness issue.  (Although the library could be whacked.) 

It is the reason why "typedef ... uchar" is not really needed, because it
be backed by -funsigned-char in anycase as workaround of massive warnings.
-funsigned-char alone seems enough.

> A big worry here is that setlocale() is an expensive function.  On
> Solaris, for example, it involves reading a file off disk.  So we can't
> just switch locales at will.  It's unfortunate, but it's not possible for
> a POSIX program to exist in two locales at once.

One of the ways can be fork as many times as locales used and do all
locale-specific work in subprocesses.

> - (PR#754) struct tm does not include a time zone on all systems, we
> assume it does.  It's entirely possible that we'll need our own strftime() 
> replacement.

Oh, no! Apache already replaces too many standard functions. Is is time
now for Apache-libc? Lets do it only for systems which really needs them,
not for all systems as for snprintf f.e.

> - I'm certain we have assumptions that isalpha(c) is the same as c == 'a'
> || c == 'b' || ... || c == 'z' (and the upper case letters).  When in
> truth, isalpha(c) can also be true for various 8-bit characters in most
> non-"C" locales.  For example, in "ISO-8859-1", isalpha(192) is TRUE. 
> This has to be fixed. 

Unless we switch locale from "C", this assumption is true, but if we
switch somewhere, it is false. I don't think that Apache as daemon must
run itself under locale different than "C", but forked modules can switch
locale instead.

-- 
Andrey A. Chernov
<ac...@nietzsche.net>
http://www.nagual.pp.ru/~ache/