You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@httpd.apache.org by William A Rowe Jr <wr...@rowe-clan.net> on 2015/11/25 18:42:29 UTC

apr_token_* conclusions (was: Better casecmpstr[n]?)

On Wed, Nov 25, 2015 at 10:17 AM, Jim Jagielski <ji...@jagunet.com> wrote:

> What is the current status? Is this on hold?
>

It is looking for a good name.  I'm happy with apr_token_strcasecmp
to best indicate its use-case and provenance.  Does that work for
everyone?

It is looking for clearer docs.  Spent 20 hours just reviewing locale in C
(partly to phrase this discussion more accurately). About to pen those
based on better definitions of terms.

So here are my conclusions as they apply to apr, and to httpd after
all of this locale review...

Background
-----------------

The C spec defines the locale as "C" (e.g. "POSIX") at startup.  Until
somewhere in the code the locale of the current thread is switched with
a call to setlocale(LC_ALL, "") or similar, the application remains in this
non-deterministic (Anglicized) state.  The empty string causes LANG
and the LC_* variables to be evaluated.  Most modern *nix utilities do
this right off the bat in their source.  The compiler should *not* do so
without instruction, so while gcc does the right thing in this respect,
other compilers may not be behaving appropriately.

You might react "how do we handle UTF-8 [or other code page] then?"
The answer is that in the "C" locale, high-bit characters (in ASCII, and
even unusual EBCDIC codes) are effectively opaque.  They *can* be
UTF-8, they might belong to an SBCS (single-byte charset), and they
might be entirely meaningless.  They won't be case folded (but will
keep their unique identity).  Some consumer may recognize them,
others will not.  The C lib functions will treat each octet of an MBCS
as a distinct character, which is a reason that our old-school autoindex
(pre-fancy tables) misaligns the columns when the filename contains
any multibyte characters.  We had simply byte-counted chars.

If our code splits these multibyte sequences, things can go wrong
somewhere down the way.  In particular, treating these multibyte
sequences as distinct characters is fine in UTF-8 where every part
of the multibyte sequence is high-bit-set (and therefore opaque),
but is not fine in ISO2022-JP, where low-bit-set characters may
change their meaning (there are bugs to the effect that we search
a string for pathname separator characters, and these can occur
in multibyte sequences which are valid file names, and not path
separator characters).  There isn't much we can do about this.

When using httpd on a *nix filesystem containing UTF-8 chars, we
accept UTF-8 filenames both in client provided fields and within the
httpd.conf configuration.  On a filesystem containing SBCS filenames,
we similarly accept these without translation.  It's up to the admin
to decide their schema, without a third party module, the UTF-8
name isn't recognized in an SBCS directory, and visa-versa.

On Windows, all system resources, including filenames, are stored
in Unicode by the OS.  Within APR, we simply treat all system
resource strings as UTF-8, and therefore the conf file in httpd needs
to spell out the UTF-8 name of the resource. All that said, other than
these resource names, even on Windows most string processing
follows the same opaque logic as *nix.

The Mac OS is somewhat similar to Windows, in that all of the
resource names are actually UTF-8 encoded, AFAICT. But unlike
windows, these opaque strings "just work", we don't have to do any
Unicode transliteration to get there.

Observations
-------------------

Nowhere in apr or httpd do *we* call setlocale() to change things.  So
the current use of tolower(), toupper(), strcasecmp() etc should not be
subject to dangerous transliterations by *our* doing.

There is nothing that suggests that the APR consumer *cannot*
call setlocale()!  So we need to revisit APR 2.0 and ensure that
our functions perform as-expected even when operating under
some different locale.  This suggests that apr_token_str[n]casecmp,
apr_token_tolower|toupper, and much of the rest of the code that
exists only to evaluate ASCII alpha characters is *not* paying any
attention to the currently defined locale.

In httpd, where things go sideways is if someone is calling setlocale(),
for example in an in-process PHP, Perl or Lua script, because this
changes the core operation of httpd.  If the script switches setlocale
to turkish, for example, our forced-lowercase content-type conversion
will cause "IMAGE/GIF" to become "ımage/gıf", clearly not what the
specs intended.

I imagine that this has rarely shown up because the few scripts that
might toggle this were correctly written to toggle back the previous
locale upon completion.  But if they are running inline and have
the locale toggled during the filter handoff, the filters themselves
are likely facing some unexpected behavior.

APR conclusions
-------------------------

Adding unambiguous token handling functions would be good for
the few case-insensitive string comparison, string folding, and
search functions.  It allows the spec-consumer to trust their string
processing.  I'm going to suggest apr_token_* as the API prefix.
The API will preserve "POSIX behavior" as defined by A-Z <> a-z
character equivalence and make no compensation for locales
or for MBCS strings.

httpd conclusions
-------------------------

There are so many edge cases that we simply need to preserve
the code as-is in httpd 2.4 and warn off anyone toggling the locale
that httpd is operating under.  Making the 'offer' to run under many
non-POSIX locales is opening up a can of security vulnerabilities.
It would be irresponsible to half-fix this.

We should perform a thorough review of httpd 2.x trunk so as to
make a statement upon release that the code has been adapted
to correctly operate under most non-POSIX locales.  Warn off
users from third party modules that have not yet made a similar
statement.  Kill all the redundant httpd 2.x == APR functions that
do not belong in trunk (but may have been appropriate in 2.earlier
while waiting for APR to be released).

To the extent that implementations have very poor implementations
of strcmp(), this isn't justification to overload httpd with more band-aids.
Fix the underlying implementation.  So I'm -0.5 on backporting this
change into httpd 2.4 until we see a comprehensive justification, or
until we comprehensively have fixed trunk and are prepared to make
the same "setlocale()-safe" assertion about 2.4.future.

Re: apr_token_* conclusions

Posted by Branko Čibej <br...@apache.org>.
On 01.12.2015 05:31, William A Rowe Jr wrote:
>
> That describes the 'token' use case, right?  While MMX operands let
> the clib devs play with 16-byte/dword/word units, we are principally
> looking at very short strings.  As soon as you do a 16 byte compare
> w/delimiting the null byte, your optimization is lost.
>
> I think we are of one mind on this, sniping aside.  I started with an
> svn cp today from asf subversion, and chose to focus on only the
> svn_cstring_ (excluding svn_string and svn_stringbuf ops), but there
> is room if we give the nod and should treat them as 3 seperate
> groupings.  First commit inbound in the morning with lots of room for
> optimization.
>

Ack. I think svn_string, and certainly svn_stringbuf, are out of scope
for APR.

-- Brane

Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

Posted by Yann Ylavic <yl...@gmail.com>.
I agree that the discussion about the implementation belongs in apr-dev@.
Cross posting my original httpd-dev@ message below yours (the wording
may be httpd -use case- related, sorry about that)...

On Tue, Dec 1, 2015 at 5:31 AM, William A Rowe Jr <wr...@rowe-clan.net> wrote:
> That describes the 'token' use case, right?  While MMX operands let the clib
> devs play with 16-byte/dword/word units, we are principally looking at very
> short strings.  As soon as you do a 16 byte compare w/delimiting the null
> byte, your optimization is lost.
>
> I think we are of one mind on this, sniping aside.  I started with an svn cp
> today from asf subversion, and chose to focus on only the svn_cstring_
> (excluding svn_string and svn_stringbuf ops), but there is room if we give
> the nod and should treat them as 3 seperate groupings.  First commit inbound
> in the morning with lots of room for optimization.


On Nov 30, 2015 11:20, "Yann Ylavic" <yl...@gmail.com> wrote:

Sorry for the late, was afk this times...

Regarding the name, I'm fine with ap[r]_cstr[n]casecmp(),
ap[r]_casecmpcstr[n]() or ap[r]_cstr_*() (if we need a set of
functions in this area)..

I think we all agree that the new function(s) would help protocol
"validation" being agnostic wrt the locale, though httpd (as any *nix
program) runs in the "C" locale by default (hence str[n]casecmp()
behave as expected), and this can't be changed unless some
(third-party-)module plays with setlocale(), as Bill said).

So the new function(s) would address two concerns:
1. doing the right thing at the protocol level if/when modules need
custom locales,
2. have an effecient "C"-string caseless comparison function on all
platforms (see tests results below).

For 1. I agree we should not hurry and take the time to review the
kind of changes I proposed in [1].

For 2. I think we can start using the new function(s) whenever we are
dealing with "C"-strings and this is a fast path (eg. Jean-Frederic's
report about ap_proxy_port_of_scheme(), which should be addressed both
in httpd and APR IMHO).


Regarding performances, attached are the tests (and results) I ran on
different systems (linuxes+glibc+gcc only!, i.e.
Debian6+glibc-2.11+gcc-4.4, Debian8+glibc-2.19+gcc-4.9 and
CentOS7+glibc-2.17+gcc-4.8) for the different implementations that
were discussed so far (including standard strncasecmp,
svn_cstring_casecmp, and Mikhail's mi_strcasecmp).

<tl;dr>

a. Our implementation(s) are faster than str[n]casecmp() for strings
lengths < 4 or 8 (depending on sizeof(long), ie. 32bit vs 64bit
system), which matters not only for such short strings but also when
the compared strings differ in these first bytes (our implementation
fails faster too here),

b. Latests str[n]casecmp() (or/and gcc) are far faster (x3) than any
of our proposal in the "C" (or "UTF-8") locale for longer strings, too
bad there is no strcasecmp[_loc]() taking the locale as argument (à la
stdc++)...
*However*, whenever mappings are in place, eg. the famous
mt_MT.ISO-8859, str[n]casecmp() takes the same time as our
implementation (comparing the same number of caseless-equal
characters),

c. Our best implementation, which is performing well in all cases (ie.
no "pathological" behaviour with some cases) is Jim's "ap_casestrcmp"
(the current one).
Actually the ones performing a bit better are those called
"ap_casestrcmp_1" and "ap_casestrcmp_2" in the test, the former being
the same as Jim's but with "++ps1; ++ps2;" done at the end of the
loop, and the latter being my proposed version using an index instead
of char pointers (no gain compared to "ap_casestrcmp_1", not worth the
change...).
So I'd be for using Jim's with the simple "++ps1; ++ps2;" change.

</tl;dr>


The attached test results are the ones run on CentOS7 (because this is
the system of a real/performant machine I can access, and running the
tests on my Debian laptop make it hot enough to be unfair :)
Since I'm not very used to CentOS, I could not make the
"mt_MT.ISO-8859" locale work/being applied, either because I'm doing
things wrong, or sowehow the locale has been updated to avoid this
mapping (though I was able to make it work with latest debians, where
strcasecmp() performs differently depending on the locale...).

So for completeness, I'm pasting the results on a debian jessie for
locales "mt_MT.ISO-8859" and "C" here, since it matters there:

$ LC_ALL=mt_MT.iso88593 ./ap_casecmpstr-O2 'a' 100000000
'CyCyCyCyCyCyCyCyCyCoOoOoOoOoOaAaAaAaAaAa'
'cYcYcYcYcYcYcYcYcYcOoOoOoOoOoAaAaAaAaAaa' 0
./ap_casecmpstr-O2 'a' 100000000
"CyCyCyCyCyCyCyCyCyCoOoOoOoOoOaAaAaAaAaAa"
"cYcYcYcYcYcYcYcYcYcOoOoOoOoOoAaAaAaAaAaa" 0: locale "mt_MT.iso88593"
- ap_casecmpstr       : time=06.160937456, res=0
- ap_casecmpstr_1     : time=06.256894742, res=0
- ap_casecmpstr_2     : time=06.136213804, res=0
- ap_casecmpstr_4     : time=06.787756289, res=0
- ap_casecmpstr_3     : time=06.110559311, res=0
- ap_casecmpstr_7     : time=06.844624092, res=0
- ap_casecmpstr_5     : time=06.820174763, res=0
- ap_casecmpstr_6     : time=10.488436936, res=0
- svn_cstring_casecmp : time=07.329213881, res=0
- mi_strcasecmp       : time=10.165367784, res=0
- strcasecmp_ext      : time=06.274211596, res=0
- strcasecmp          : time=06.126361486, res=0
- strcmp              : time=00.590613344, res=-32 != str[n]casecmp()'s result!

$ LC_ALL=mt_MT.iso88593 ./ap_casecmpstr-O2 'a' 100000000
$'\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9'
'iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii' 0
./ap_casecmpstr-O2 'a' 100000000 "<...40 unprintable chars here...>"
"iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii" 0: locale "mt_MT.iso88593"
- ap_casecmpstr       : time=00.479203198, res=64 != str[n]casecmp()'s result!
- ap_casecmpstr_1     : time=00.526867671, res=64 != str[n]casecmp()'s result!
- ap_casecmpstr_2     : time=00.525341010, res=64 != str[n]casecmp()'s result!
- ap_casecmpstr_4     : time=00.529071848, res=64 != str[n]casecmp()'s result!
- ap_casecmpstr_3     : time=00.528699442, res=64 != str[n]casecmp()'s result!
- ap_casecmpstr_7     : time=00.507921257, res=64 != str[n]casecmp()'s result!
- ap_casecmpstr_5     : time=00.524753726, res=64 != str[n]casecmp()'s result!
- ap_casecmpstr_6     : time=00.524081999, res=64 != str[n]casecmp()'s result!
- svn_cstring_casecmp : time=00.509402346, res=64 != str[n]casecmp()'s result!
- mi_strcasecmp       : time=00.532081427, res=96 != str[n]casecmp()'s result!
- strcasecmp_ext      : time=06.309950716, res=0
- strcasecmp          : time=06.148251655, res=0
- strcmp              : time=00.525644420, res=64 != str[n]casecmp()'s result!

Whereas with "C" locale, I've got:

$ LC_ALL=C ./ap_casecmpstr-O2 'a' 100000000
'CyCyCyCyCyCyCyCyCyCoOoOoOoOoOaAaAaAaAaAa'
'cYcYcYcYcYcYcYcYcYcOoOoOoOoOoAaAaAaAaAaa' 0
./ap_casecmpstr-O2 'a' 100000000
"CyCyCyCyCyCyCyCyCyCoOoOoOoOoOaAaAaAaAaAa"
"cYcYcYcYcYcYcYcYcYcOoOoOoOoOoAaAaAaAaAaa" 0: locale "C"
- ap_casecmpstr       : time=06.191792200, res=0
- ap_casecmpstr_1     : time=06.147878566, res=0
- ap_casecmpstr_2     : time=06.333936899, res=0
- ap_casecmpstr_4     : time=06.870865790, res=0
- ap_casecmpstr_3     : time=06.227310131, res=0
- ap_casecmpstr_7     : time=06.856304522, res=0
- ap_casecmpstr_5     : time=06.788184432, res=0
- ap_casecmpstr_6     : time=10.437171106, res=0
- svn_cstring_casecmp : time=07.325735333, res=0
- mi_strcasecmp       : time=10.351743646, res=0
- strcasecmp_ext      : time=01.649636857, res=0
- strcasecmp          : time=01.443062626, res=0
- strcmp              : time=00.502131680, res=-32 != str[n]casecmp()'s result!

$ LC_ALL=C ./ap_casecmpstr-O2 'a' 100000000
$'\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9'
'iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii' 0
./ap_casecmpstr-O2 'a' 100000000 "<...40 unprintable chars here...>"
"iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii" 0: locale "C"
- ap_casecmpstr       : time=00.489929313, res=64
- ap_casecmpstr_1     : time=00.537979168, res=64
- ap_casecmpstr_2     : time=00.528165151, res=64
- ap_casecmpstr_4     : time=00.541039123, res=64
- ap_casecmpstr_3     : time=00.545316485, res=64
- ap_casecmpstr_7     : time=00.522669572, res=64
- ap_casecmpstr_5     : time=00.541433772, res=64
- ap_casecmpstr_6     : time=00.529486005, res=64
- svn_cstring_casecmp : time=00.518018373, res=64
- mi_strcasecmp       : time=00.551022179, res=96 != str[n]casecmp()'s result!
- strcasecmp_ext      : time=01.070935658, res=64
- strcasecmp          : time=00.950721549, res=64
- strcmp              : time=00.529242598, res=64

Regards,
Yann.

[1] http://permalink.gmane.org/gmane.comp.apache.devel/57670

Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

Posted by William A Rowe Jr <wr...@rowe-clan.net>.
That describes the 'token' use case, right?  While MMX operands let the
clib devs play with 16-byte/dword/word units, we are principally looking at
very short strings.  As soon as you do a 16 byte compare w/delimiting the
null byte, your optimization is lost.

I think we are of one mind on this, sniping aside.  I started with an svn
cp today from asf subversion, and chose to focus on only the svn_cstring_
(excluding svn_string and svn_stringbuf ops), but there is room if we give
the nod and should treat them as 3 seperate groupings.  First commit
inbound in the morning with lots of room for optimization.

Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

Posted by William A Rowe Jr <wr...@rowe-clan.net>.
I've hijacked Yann's thoughts and replied on a dev@apr thread.  There is
merit in httpd's deliberations but the issue is sufficiently larger than
'just us ourselves'.

Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

Posted by Yann Ylavic <yl...@gmail.com>.
Sorry for the late, was afk this times...

Regarding the name, I'm fine with ap[r]_cstr[n]casecmp(),
ap[r]_casecmpcstr[n]() or ap[r]_cstr_*() (if we need a set of
functions in this area)..

I think we all agree that the new function(s) would help protocol
"validation" being agnostic wrt the locale, though httpd (as any *nix
program) runs in the "C" locale by default (hence str[n]casecmp()
behave as expected), and this can't be changed unless some
(third-party-)module plays with setlocale(), as Bill said).

So the new function(s) would address two concerns:
1. doing the right thing at the protocol level if/when modules need
custom locales,
2. have an effecient "C"-string caseless comparison function on all
platforms (see tests results below).

For 1. I agree we should not hurry and take the time to review the
kind of changes I proposed in [1].

For 2. I think we can start using the new function(s) whenever we are
dealing with "C"-strings and this is a fast path (eg. Jean-Frederic's
report about ap_proxy_port_of_scheme(), which should be addressed both
in httpd and APR IMHO).


Regarding performances, attached are the tests (and results) I ran on
different systems (linuxes+glibc+gcc only!, i.e.
Debian6+glibc-2.11+gcc-4.4, Debian8+glibc-2.19+gcc-4.9 and
CentOS7+glibc-2.17+gcc-4.8) for the different implementations that
were discussed so far (including standard strncasecmp,
svn_cstring_casecmp, and Mikhail's mi_strcasecmp).

<tl;dr>

a. Our implementation(s) are faster than str[n]casecmp() for strings
lengths < 4 or 8 (depending on sizeof(long), ie. 32bit vs 64bit
system), which matters not only for such short strings but also when
the compared strings differ in these first bytes (our implementation
fails faster too here),

b. Latests str[n]casecmp() (or/and gcc) are far faster (x3) than any
of our proposal in the "C" (or "UTF-8") locale for longer strings, too
bad there is no strcasecmp[_loc]() taking the locale as argument (à la
stdc++)...
*However*, whenever mappings are in place, eg. the famous
mt_MT.ISO-8859, str[n]casecmp() takes the same time as our
implementation (comparing the same number of caseless-equal
characters),

c. Our best implementation, which is performing well in all cases (ie.
no "pathological" behaviour with some cases) is Jim's "ap_casestrcmp"
(the current one).
Actually the ones performing a bit better are those called
"ap_casestrcmp_1" and "ap_casestrcmp_2" in the test, the former being
the same as Jim's but with "++ps1; ++ps2;" done at the end of the
loop, and the latter being my proposed version using an index instead
of char pointers (no gain compared to "ap_casestrcmp_1", not worth the
change...).
So I'd be for using Jim's with the simple "++ps1; ++ps2;" change.

</tl;dr>


The attached test results are the ones run on CentOS7 (because this is
the system of a real/performant machine I can access, and running the
tests on my Debian laptop make it hot enough to be unfair :)
Since I'm not very used to CentOS, I could not make the
"mt_MT.ISO-8859" locale work/being applied, either because I'm doing
things wrong, or sowehow the locale has been updated to avoid this
mapping (though I was able to make it work with latest debians, where
strcasecmp() performs differently depending on the locale...).

So for completeness, I'm pasting the results on a debian jessie for
locales "mt_MT.ISO-8859" and "C" here, since it matters there:

$ LC_ALL=mt_MT.iso88593 ./ap_casecmpstr-O2 'a' 100000000
'CyCyCyCyCyCyCyCyCyCoOoOoOoOoOaAaAaAaAaAa'
'cYcYcYcYcYcYcYcYcYcOoOoOoOoOoAaAaAaAaAaa' 0
./ap_casecmpstr-O2 'a' 100000000
"CyCyCyCyCyCyCyCyCyCoOoOoOoOoOaAaAaAaAaAa"
"cYcYcYcYcYcYcYcYcYcOoOoOoOoOoAaAaAaAaAaa" 0: locale "mt_MT.iso88593"
- ap_casecmpstr       : time=06.160937456, res=0
- ap_casecmpstr_1     : time=06.256894742, res=0
- ap_casecmpstr_2     : time=06.136213804, res=0
- ap_casecmpstr_4     : time=06.787756289, res=0
- ap_casecmpstr_3     : time=06.110559311, res=0
- ap_casecmpstr_7     : time=06.844624092, res=0
- ap_casecmpstr_5     : time=06.820174763, res=0
- ap_casecmpstr_6     : time=10.488436936, res=0
- svn_cstring_casecmp : time=07.329213881, res=0
- mi_strcasecmp       : time=10.165367784, res=0
- strcasecmp_ext      : time=06.274211596, res=0
- strcasecmp          : time=06.126361486, res=0
- strcmp              : time=00.590613344, res=-32 != str[n]casecmp()'s result!

$ LC_ALL=mt_MT.iso88593 ./ap_casecmpstr-O2 'a' 100000000
$'\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9'
'iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii' 0
./ap_casecmpstr-O2 'a' 100000000 "<...40 unprintable chars here...>"
"iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii" 0: locale "mt_MT.iso88593"
- ap_casecmpstr       : time=00.479203198, res=64 != str[n]casecmp()'s result!
- ap_casecmpstr_1     : time=00.526867671, res=64 != str[n]casecmp()'s result!
- ap_casecmpstr_2     : time=00.525341010, res=64 != str[n]casecmp()'s result!
- ap_casecmpstr_4     : time=00.529071848, res=64 != str[n]casecmp()'s result!
- ap_casecmpstr_3     : time=00.528699442, res=64 != str[n]casecmp()'s result!
- ap_casecmpstr_7     : time=00.507921257, res=64 != str[n]casecmp()'s result!
- ap_casecmpstr_5     : time=00.524753726, res=64 != str[n]casecmp()'s result!
- ap_casecmpstr_6     : time=00.524081999, res=64 != str[n]casecmp()'s result!
- svn_cstring_casecmp : time=00.509402346, res=64 != str[n]casecmp()'s result!
- mi_strcasecmp       : time=00.532081427, res=96 != str[n]casecmp()'s result!
- strcasecmp_ext      : time=06.309950716, res=0
- strcasecmp          : time=06.148251655, res=0
- strcmp              : time=00.525644420, res=64 != str[n]casecmp()'s result!

Whereas with "C" locale, I've got:

$ LC_ALL=C ./ap_casecmpstr-O2 'a' 100000000
'CyCyCyCyCyCyCyCyCyCoOoOoOoOoOaAaAaAaAaAa'
'cYcYcYcYcYcYcYcYcYcOoOoOoOoOoAaAaAaAaAaa' 0
./ap_casecmpstr-O2 'a' 100000000
"CyCyCyCyCyCyCyCyCyCoOoOoOoOoOaAaAaAaAaAa"
"cYcYcYcYcYcYcYcYcYcOoOoOoOoOoAaAaAaAaAaa" 0: locale "C"
- ap_casecmpstr       : time=06.191792200, res=0
- ap_casecmpstr_1     : time=06.147878566, res=0
- ap_casecmpstr_2     : time=06.333936899, res=0
- ap_casecmpstr_4     : time=06.870865790, res=0
- ap_casecmpstr_3     : time=06.227310131, res=0
- ap_casecmpstr_7     : time=06.856304522, res=0
- ap_casecmpstr_5     : time=06.788184432, res=0
- ap_casecmpstr_6     : time=10.437171106, res=0
- svn_cstring_casecmp : time=07.325735333, res=0
- mi_strcasecmp       : time=10.351743646, res=0
- strcasecmp_ext      : time=01.649636857, res=0
- strcasecmp          : time=01.443062626, res=0
- strcmp              : time=00.502131680, res=-32 != str[n]casecmp()'s result!

$ LC_ALL=C ./ap_casecmpstr-O2 'a' 100000000
$'\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9'
'iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii' 0
./ap_casecmpstr-O2 'a' 100000000 "<...40 unprintable chars here...>"
"iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii" 0: locale "C"
- ap_casecmpstr       : time=00.489929313, res=64
- ap_casecmpstr_1     : time=00.537979168, res=64
- ap_casecmpstr_2     : time=00.528165151, res=64
- ap_casecmpstr_4     : time=00.541039123, res=64
- ap_casecmpstr_3     : time=00.545316485, res=64
- ap_casecmpstr_7     : time=00.522669572, res=64
- ap_casecmpstr_5     : time=00.541433772, res=64
- ap_casecmpstr_6     : time=00.529486005, res=64
- svn_cstring_casecmp : time=00.518018373, res=64
- mi_strcasecmp       : time=00.551022179, res=96 != str[n]casecmp()'s result!
- strcasecmp_ext      : time=01.070935658, res=64
- strcasecmp          : time=00.950721549, res=64
- strcmp              : time=00.529242598, res=64

Regards,
Yann.

[1] http://permalink.gmane.org/gmane.comp.apache.devel/57670

Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

Posted by William A Rowe Jr <wr...@rowe-clan.net>.
On Wed, Nov 25, 2015 at 9:44 PM, William A Rowe Jr <wr...@rowe-clan.net>
wrote:

> LANG="ku_TR.iso88599";
>    64 = @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
>       ^ @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`ABCDEFGHİJKLMNOPQRSTUVWXYZ{|}~
>       v @abcdefghıjklmnopqrstuvwxyz[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
>       ?  ........*.................      ''''''''*'''''''''''''''''
>   192 = ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏĞÑÒÓÔÕÖ×ØÙÚÛÜİŞßàáâãäåæçèéêëìíîïğñòóôõö÷øùúûüışÿ
>       ^ ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏĞÑÒÓÔÕÖ×ØÙÚÛÜİŞßÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏĞÑÒÓÔÕÖ÷ØÙÚÛÜIŞÿ
>       v àáâãäåæçèéêëìíîïğñòóôõö×øùúûüişßàáâãäåæçèéêëìíîïğñòóôõö÷øùúûüışÿ
>       ? ....................... .....*. ''''''''''''''''''''''' '''''*'
>

The translation here is pretty simple.  We display the ^ toupper() and the
v tolower() value of every character.  For the summary line '?', in normal
or -v verbose mode, ' ' suggests no translations at all, '.' means this ch
has a lower case translation, ' means the cc has an upper case translation,
but I strip most of these lines out while searching for the exceptional
cases...

'*' is the surprising case, the high bit character translation falls into
the ancient 0-127 code plane, or a ch 0-127 falls into the high bit plane,
or anything within the traditional 0-127 code plane translates into an
unexpected position.

LANG="mt_MT.iso88593";
  128 =                                  Ħ˘£¤ Ĥ§¨İŞĞĴ­ Ż°ħ²³´µĥ·¸ışğĵ½ ż
      ^                                  Ħ˘£¤ Ĥ§¨İŞĞĴ­ Ż°Ħ²³´µĤ·¸IŞĞĴ½ Ż
      v                                  ħ˘£¤ ĥ§¨işğĵ­ ż°ħ²³´µĥ·¸ışğĵ½ ż
      ?                                  .    .  *...  . '    '  *'''  '

The last example above seems to indicate an isprint() validation error
or utf-8 mis-assignment in iconv, somewhere in the last 16 characters
of this code table, apparently between Ĵ­ and Ż :)

Fwd: apr_token_* conclusions (was: Better casecmpstr[n]?)

Posted by William A Rowe Jr <wr...@rowe-clan.net>.
Some further analysis...

---------- Forwarded message ----------
From: William A Rowe Jr <wr...@rowe-clan.net>
Date: Wed, Nov 25, 2015 at 9:44 PM
Subject: Re: apr_token_* conclusions (was: Better casecmpstr[n]?)
To: httpd <de...@httpd.apache.org>


On Wed, Nov 25, 2015 at 6:45 PM, William A Rowe Jr <wr...@rowe-clan.net>
wrote:

> On Nov 25, 2015 4:19 PM, "Mikhail T." <mi...@aldan.algebra.com> wrote:
> >
> > Thus, I contend, using C-library will not cause invalid results, and the
> only reason to have Apache's own implementation is performance, but not
> correctness.
>
> Well almost but wrong...
>
> The pure char-based ß processing produced no case change in my reviews of
> tolower/toupper in de_DE codeset. If you were to examine string comparison
> the collation order changes substantially.
>
> And more to the point, if tolower()/toupper() could handle not only mbcs
but multicharacter transliteration, your results would have varied.  1:1
character translations have their intrinsic limits.

> That said, I'm working up a comprehensive audit and other codeset/language
> combinations absolutely do.  Code and results forthcoming shortly.
>
As promised, here's a quick review based on the sbcs and utf8 code pages in
the very limited single-byte scope on my machine.

I did not touch the following mbcs because they require 'shift-state' to
toggle into and out of specific characters and that implies a lot of
calculated fuzzing that I didn't have time for this week.  (Since mod_ftp
explicit tls is still broken, I had no time for any of this, either ;-)  I
also didn't get to evaluating the wide chars yet that fall into the
traditional posix/c ascii range, which I still mean to do, and haven't yet
repeated this exercise on win32 or os/x, only on a somewhat multinational
configuration of fedora 22.

The source code is pretty rudimentary.  I used iconv to shove all of the
resulting text evaluation into utf-8 for the console/file output, it really
plays no part in the locality equation.  It can be adapted for testing
similar on an EBCDIC box with a bit of clever coding I never got to.

Untested: ja_JP.eucjp ja_JP.ujis japanese.euc ko_KR.euckr korean.euc
zh_CN.gb18030 zh_CN.gb2312 zh_CN.gbk zh_HK.big5hkscs zh_SG.gb2312 zh_SG.gbk
zh_TW.big5 zh_TW.euctw

Tested and exceptional results noted (source code attached);

LANG="aa_DJ.iso88591";
        no surprises
LANG="af_ZA.iso88591";
        no surprises
LANG="an_ES.iso885915";
        no surprises
LANG="ar_AE.iso88596";
        no surprises
LANG="ar_BH.iso88596";
        no surprises
LANG="ar_DZ.iso88596";
        no surprises
LANG="ar_EG.iso88596";
        no surprises
LANG="ar_IQ.iso88596";
        no surprises
LANG="ar_JO.iso88596";
        no surprises
LANG="ar_KW.iso88596";
        no surprises
LANG="ar_LB.iso88596";
        no surprises
LANG="ar_LY.iso88596";
        no surprises
LANG="ar_MA.iso88596";
        no surprises
LANG="ar_OM.iso88596";
        no surprises
LANG="ar_QA.iso88596";
        no surprises
LANG="ar_SA.iso88596";
        no surprises
LANG="ar_SD.iso88596";
        no surprises
LANG="ar_SY.iso88596";
        no surprises
LANG="ar_TN.iso88596";
        no surprises
LANG="ar_YE.iso88596";
        no surprises
LANG="ast_ES.iso885915";
        no surprises
LANG="be_BY.cp1251";
        no surprises
LANG="bg_BG.cp1251";
        no surprises
LANG="br_FR.iso88591";
        no surprises
LANG="br_FR.iso885915@euro";
        no surprises
LANG="bs_BA.iso88592";
        no surprises
LANG="ca_AD.iso885915";
        no surprises
LANG="ca_ES.iso88591";
        no surprises
LANG="ca_ES.iso885915@euro";
        no surprises
LANG="ca_FR.iso885915";
        no surprises
LANG="ca_IT.iso885915";
        no surprises
LANG="cs_CZ.iso88592";
        no surprises
LANG="cy_GB.iso885914";
        no surprises
LANG="da_DK.iso88591";
        no surprises
LANG="da_DK.iso885915";
        no surprises
LANG="de_AT.iso88591";
        no surprises
LANG="de_AT.iso885915@euro";
        no surprises
LANG="de_BE.iso88591";
        no surprises
LANG="de_BE.iso885915@euro";
        no surprises
LANG="de_CH.iso88591";
        no surprises
LANG="de_DE.iso88591";
        no surprises
LANG="de_DE.iso885915@euro";
        no surprises
LANG="de_LU.iso88591";
        no surprises
LANG="de_LU.iso885915@euro";
        no surprises
LANG="el_CY.iso88597";
        no surprises
LANG="el_GR.iso88597";
        no surprises
LANG="en_AU.iso88591";
        no surprises
LANG="en_BW.iso88591";
        no surprises
LANG="en_CA.iso88591";
        no surprises
LANG="en_DK.iso88591";
        no surprises
LANG="en_GB.iso88591";
        no surprises
LANG="en_GB.iso885915";
        no surprises
LANG="en_HK.iso88591";
        no surprises
LANG="en_IE.iso88591";
        no surprises
LANG="en_IE.iso885915@euro";
        no surprises
LANG="en_NZ.iso88591";
        no surprises
LANG="en_PH.iso88591";
        no surprises
LANG="en_SG.iso88591";
        no surprises
LANG="en_US.iso88591";
        no surprises
LANG="en_US.iso885915";
        no surprises
LANG="en_ZA.iso88591";
        no surprises
LANG="en_ZW.iso88591";
        no surprises
LANG="es_AR.iso88591";
        no surprises
LANG="es_BO.iso88591";
        no surprises
LANG="es_CL.iso88591";
        no surprises
LANG="es_CO.iso88591";
        no surprises
LANG="es_CR.iso88591";
        no surprises
LANG="es_DO.iso88591";
        no surprises
LANG="es_EC.iso88591";
        no surprises
LANG="es_ES.iso88591";
        no surprises
LANG="es_ES.iso885915@euro";
        no surprises
LANG="es_GT.iso88591";
        no surprises
LANG="es_HN.iso88591";
        no surprises
LANG="es_MX.iso88591";
        no surprises
LANG="es_NI.iso88591";
        no surprises
LANG="es_PA.iso88591";
        no surprises
LANG="es_PE.iso88591";
        no surprises
LANG="es_PR.iso88591";
        no surprises
LANG="es_PY.iso88591";
        no surprises
LANG="es_SV.iso88591";
        no surprises
LANG="es_US.iso88591";
        no surprises
LANG="es_UY.iso88591";
        no surprises
LANG="es_VE.iso88591";
        no surprises
LANG="et_EE.iso88591";
        no surprises
LANG="et_EE.iso885915";
        no surprises
LANG="eu_ES.iso88591";
        no surprises
LANG="eu_ES.iso885915@euro";
        no surprises
LANG="fi_FI.iso88591";
        no surprises
LANG="fi_FI.iso885915@euro";
        no surprises
LANG="fo_FO.iso88591";
        no surprises
LANG="fr_BE.iso88591";
        no surprises
LANG="fr_BE.iso885915@euro";
        no surprises
LANG="fr_CA.iso88591";
        no surprises
LANG="fr_CH.iso88591";
        no surprises
LANG="fr_FR.iso88591";
        no surprises
LANG="fr_FR.iso885915@euro";
        no surprises
LANG="fr_LU.iso88591";
        no surprises
LANG="fr_LU.iso885915@euro";
        no surprises
LANG="ga_IE.iso88591";
        no surprises
LANG="ga_IE.iso885915@euro";
        no surprises
LANG="gd_GB.iso885915";
        no surprises
LANG="gl_ES.iso88591";
        no surprises
LANG="gl_ES.iso885915@euro";
        no surprises
LANG="gv_GB.iso88591";
        no surprises
LANG="he_IL.iso88598";
        no surprises
LANG="hr_HR.iso88592";
        no surprises
LANG="hsb_DE.iso88592";
        no surprises
LANG="hu_HU.iso88592";
        no surprises
LANG="hy_AM.armscii8";
        no surprises
LANG="id_ID.iso88591";
        no surprises
LANG="is_IS.iso88591";
        no surprises
LANG="it_CH.iso88591";
        no surprises
LANG="it_IT.iso88591";
        no surprises
LANG="it_IT.iso885915@euro";
        no surprises
LANG="iw_IL.iso88598";
        no surprises
LANG="ka_GE.georgianps";
        no surprises
LANG="kk_KZ.pt154";
        no surprises
LANG="kl_GL.iso88591";
        no surprises
LANG="ku_TR.iso88599";
   64 = @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
      ^ @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`ABCDEFGHİJKLMNOPQRSTUVWXYZ{|}~
      v @abcdefghıjklmnopqrstuvwxyz[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
      ?  ........*.................      ''''''''*'''''''''''''''''
  192 = ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏĞÑÒÓÔÕÖ×ØÙÚÛÜİŞßàáâãäåæçèéêëìíîïğñòóôõö÷øùúûüışÿ
      ^ ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏĞÑÒÓÔÕÖ×ØÙÚÛÜİŞßÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏĞÑÒÓÔÕÖ÷ØÙÚÛÜIŞÿ
      v àáâãäåæçèéêëìíîïğñòóôõö×øùúûüişßàáâãäåæçèéêëìíîïğñòóôõö÷øùúûüışÿ
      ? ....................... .....*. ''''''''''''''''''''''' '''''*'
LANG="kw_GB.iso88591";
        no surprises
LANG="lg_UG.iso885910";
        no surprises
LANG="lt_LT.iso885913";
        no surprises
LANG="lv_LV.iso885913";
        no surprises
LANG="mg_MG.iso885915";
        no surprises
LANG="mi_NZ.iso885913";
        no surprises
LANG="mk_MK.iso88595";
        no surprises
LANG="ms_MY.iso88591";
        no surprises
LANG="mt_MT.iso88593";
  128 =                                  Ħ˘£¤ Ĥ§¨İŞĞĴ­ Ż°ħ²³´µĥ·¸ışğĵ½ ż
      ^                                  Ħ˘£¤ Ĥ§¨İŞĞĴ­ Ż°Ħ²³´µĤ·¸IŞĞĴ½ Ż
      v                                  ħ˘£¤ ĥ§¨işğĵ­ ż°ħ²³´µĥ·¸ışğĵ½ ż
      ?                                  .    .  *...  . '    '  *'''  '
LANG="nb_NO.iso88591";
        no surprises
LANG="nl_BE.iso88591";
        no surprises
LANG="nl_BE.iso885915@euro";
        no surprises
LANG="nl_NL.iso88591";
        no surprises
LANG="nl_NL.iso885915@euro";
        no surprises
LANG="nn_NO.iso88591";
        no surprises
LANG="(null)";
        no surprises
LANG="oc_FR.iso88591";
        no surprises
LANG="om_KE.iso88591";
        no surprises
LANG="pl_PL.iso88592";
        no surprises
LANG="pt_BR.iso88591";
        no surprises
LANG="pt_PT.iso88591";
        no surprises
LANG="pt_PT.iso885915@euro";
        no surprises
LANG="ro_RO.iso88592";
        no surprises
LANG="ru_RU.iso88595";
        no surprises
LANG="ru_RU.koi8r";
        no surprises
LANG="ru_UA.koi8u";
        no surprises
LANG="sk_SK.iso88592";
        no surprises
LANG="sl_SI.iso88592";
        no surprises
LANG="so_DJ.iso88591";
        no surprises
LANG="so_KE.iso88591";
        no surprises
LANG="so_SO.iso88591";
        no surprises
LANG="sq_AL.iso88591";
        no surprises
LANG="st_ZA.iso88591";
        no surprises
LANG="sv_FI.iso88591";
        no surprises
LANG="sv_FI.iso885915@euro";
        no surprises
LANG="sv_SE.iso88591";
        no surprises
LANG="sv_SE.iso885915";
        no surprises
LANG="tg_TJ.koi8t";
        no surprises
LANG="th_TH.tis620";
        no surprises
LANG="tl_PH.iso88591";
        no surprises
LANG="tr_CY.iso88599";
   64 = @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
      ^ @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`ABCDEFGHİJKLMNOPQRSTUVWXYZ{|}~
      v @abcdefghıjklmnopqrstuvwxyz[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
      ?  ........*.................      ''''''''*'''''''''''''''''
  192 = ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏĞÑÒÓÔÕÖ×ØÙÚÛÜİŞßàáâãäåæçèéêëìíîïğñòóôõö÷øùúûüışÿ
      ^ ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏĞÑÒÓÔÕÖ×ØÙÚÛÜİŞßÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏĞÑÒÓÔÕÖ÷ØÙÚÛÜIŞÿ
      v àáâãäåæçèéêëìíîïğñòóôõö×øùúûüişßàáâãäåæçèéêëìíîïğñòóôõö÷øùúûüışÿ
      ? ....................... .....*. ''''''''''''''''''''''' '''''*'
LANG="tr_TR.iso88599";
   64 = @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
      ^ @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`ABCDEFGHİJKLMNOPQRSTUVWXYZ{|}~
      v @abcdefghıjklmnopqrstuvwxyz[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
      ?  ........*.................      ''''''''*'''''''''''''''''
  192 = ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏĞÑÒÓÔÕÖ×ØÙÚÛÜİŞßàáâãäåæçèéêëìíîïğñòóôõö÷øùúûüışÿ
      ^ ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏĞÑÒÓÔÕÖ×ØÙÚÛÜİŞßÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏĞÑÒÓÔÕÖ÷ØÙÚÛÜIŞÿ
      v àáâãäåæçèéêëìíîïğñòóôõö×øùúûüişßàáâãäåæçèéêëìíîïğñòóôõö÷øùúûüışÿ
      ? ....................... .....*. ''''''''''''''''''''''' '''''*'
LANG="uk_UA.koi8u";
        no surprises
LANG="uz_UZ.iso88591";
        no surprises
LANG="wa_BE.iso88591";
        no surprises
LANG="wa_BE.iso885915@euro";
        no surprises
LANG="xh_ZA.iso88591";
        no surprises
LANG="yi_US.cp1255";
        no surprises
LANG="zu_ZA.iso88591";
        no surprises
LANG="aa_DJ.utf8";
        no surprises
LANG="aa_ER.utf8";
        no surprises
LANG="aa_ER.utf8@saaho";
        no surprises
LANG="aa_ET.utf8";
        no surprises
LANG="af_ZA.utf8";
        no surprises
LANG="ak_GH.utf8";
        no surprises
LANG="am_ET.utf8";
        no surprises
LANG="an_ES.utf8";
        no surprises
LANG="anp_IN.utf8";
        no surprises
LANG="ar_AE.utf8";
        no surprises
LANG="ar_BH.utf8";
        no surprises
LANG="ar_DZ.utf8";
        no surprises
LANG="ar_EG.utf8";
        no surprises
LANG="ar_IN.utf8";
        no surprises
LANG="ar_IQ.utf8";
        no surprises
LANG="ar_JO.utf8";
        no surprises
LANG="ar_KW.utf8";
        no surprises
LANG="ar_LB.utf8";
        no surprises
LANG="ar_LY.utf8";
        no surprises
LANG="ar_MA.utf8";
        no surprises
LANG="ar_OM.utf8";
        no surprises
LANG="ar_QA.utf8";
        no surprises
LANG="ar_SA.utf8";
        no surprises
LANG="ar_SD.utf8";
        no surprises
LANG="ar_SS.utf8";
        no surprises
LANG="ar_SY.utf8";
        no surprises
LANG="ar_TN.utf8";
        no surprises
LANG="ar_YE.utf8";
        no surprises
LANG="as_IN.utf8";
        no surprises
LANG="ast_ES.utf8";
        no surprises
LANG="ayc_PE.utf8";
        no surprises
LANG="az_AZ.utf8";
   64 = @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
      ^ @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`ABCDEFGHiJKLMNOPQRSTUVWXYZ{|}~
      v @abcdefghIjklmnopqrstuvwxyz[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
      ?  ........*.................      ''''''''*'''''''''''''''''
LANG="be_BY.utf8";
        no surprises
LANG="be_BY.utf8@latin";
        no surprises
LANG="bem_ZM.utf8";
        no surprises
LANG="ber_DZ.utf8";
        no surprises
LANG="ber_MA.utf8";
        no surprises
LANG="bg_BG.utf8";
        no surprises
LANG="bh_IN.utf8";
        no surprises
LANG="bho_IN.utf8";
        no surprises
LANG="bn_BD.utf8";
        no surprises
LANG="bn_IN.utf8";
        no surprises
LANG="bo_CN.utf8";
        no surprises
LANG="bo_IN.utf8";
        no surprises
LANG="br_FR.utf8";
        no surprises
LANG="brx_IN.utf8";
        no surprises
LANG="bs_BA.utf8";
        no surprises
LANG="byn_ER.utf8";
        no surprises
LANG="ca_AD.utf8";
        no surprises
LANG="ca_ES.utf8";
        no surprises
LANG="ca_FR.utf8";
        no surprises
LANG="ca_IT.utf8";
        no surprises
LANG="ce_RU.utf8";
        no surprises
LANG="cmn_TW.utf8";
        no surprises
LANG="crh_UA.utf8";
   64 = @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
      ^ @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`ABCDEFGHiJKLMNOPQRSTUVWXYZ{|}~
      v @abcdefghIjklmnopqrstuvwxyz[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
      ?  ........*.................      ''''''''*'''''''''''''''''
LANG="csb_PL.utf8";
        no surprises
LANG="cs_CZ.utf8";
        no surprises
LANG="cv_RU.utf8";
        no surprises
LANG="cy_GB.utf8";
        no surprises
LANG="da_DK.utf8";
        no surprises
LANG="de_AT.utf8";
        no surprises
LANG="de_BE.utf8";
        no surprises
LANG="de_CH.utf8";
        no surprises
LANG="de_DE.utf8";
        no surprises
LANG="de_LU.utf8";
        no surprises
LANG="doi_IN.utf8";
        no surprises
LANG="dv_MV.utf8";
        no surprises
LANG="dz_BT.utf8";
        no surprises
LANG="el_CY.utf8";
        no surprises
LANG="el_GR.utf8";
        no surprises
LANG="en_AG.utf8";
        no surprises
LANG="en_AU.utf8";
        no surprises
LANG="en_BW.utf8";
        no surprises
LANG="en_CA.utf8";
        no surprises
LANG="en_DK.utf8";
        no surprises
LANG="en_GB.utf8";
        no surprises
LANG="en_HK.utf8";
        no surprises
LANG="en_IE.utf8";
        no surprises
LANG="en_IN.utf8";
        no surprises
LANG="en_NG.utf8";
        no surprises
LANG="en_NZ.utf8";
        no surprises
LANG="en_PH.utf8";
        no surprises
LANG="en_SG.utf8";
        no surprises
LANG="en_US.utf8";
        no surprises
LANG="en_ZA.utf8";
        no surprises
LANG="en_ZM.utf8";
        no surprises
LANG="en_ZW.utf8";
        no surprises
LANG="es_AR.utf8";
        no surprises
LANG="es_BO.utf8";
        no surprises
LANG="es_CL.utf8";
        no surprises
LANG="es_CO.utf8";
        no surprises
LANG="es_CR.utf8";
        no surprises
LANG="es_CU.utf8";
        no surprises
LANG="es_DO.utf8";
        no surprises
LANG="es_EC.utf8";
        no surprises
LANG="es_ES.utf8";
        no surprises
LANG="es_GT.utf8";
        no surprises
LANG="es_HN.utf8";
        no surprises
LANG="es_MX.utf8";
        no surprises
LANG="es_NI.utf8";
        no surprises
LANG="es_PA.utf8";
        no surprises
LANG="es_PE.utf8";
        no surprises
LANG="es_PR.utf8";
        no surprises
LANG="es_PY.utf8";
        no surprises
LANG="es_SV.utf8";
        no surprises
LANG="es_US.utf8";
        no surprises
LANG="es_UY.utf8";
        no surprises
LANG="es_VE.utf8";
        no surprises
LANG="et_EE.utf8";
        no surprises
LANG="eu_ES.utf8";
        no surprises
LANG="fa_IR.utf8";
        no surprises
LANG="ff_SN.utf8";
        no surprises
LANG="fi_FI.utf8";
        no surprises
LANG="fil_PH.utf8";
        no surprises
LANG="fo_FO.utf8";
        no surprises
LANG="fr_BE.utf8";
        no surprises
LANG="fr_CA.utf8";
        no surprises
LANG="fr_CH.utf8";
        no surprises
LANG="fr_FR.utf8";
        no surprises
LANG="fr_LU.utf8";
        no surprises
LANG="fur_IT.utf8";
        no surprises
LANG="fy_DE.utf8";
        no surprises
LANG="fy_NL.utf8";
        no surprises
LANG="ga_IE.utf8";
        no surprises
LANG="gd_GB.utf8";
        no surprises
LANG="gez_ER.utf8";
        no surprises
LANG="gez_ER.utf8@abegede";
        no surprises
LANG="gez_ET.utf8";
        no surprises
LANG="gez_ET.utf8@abegede";
        no surprises
LANG="gl_ES.utf8";
        no surprises
LANG="gu_IN.utf8";
        no surprises
LANG="gv_GB.utf8";
        no surprises
LANG="hak_TW.utf8";
        no surprises
LANG="ha_NG.utf8";
        no surprises
LANG="he_IL.utf8";
        no surprises
LANG="hi_IN.utf8";
        no surprises
LANG="hne_IN.utf8";
        no surprises
LANG="hr_HR.utf8";
        no surprises
LANG="hsb_DE.utf8";
        no surprises
LANG="ht_HT.utf8";
        no surprises
LANG="hu_HU.utf8";
        no surprises
LANG="hy_AM.utf8";
        no surprises
LANG="ia_FR.utf8";
        no surprises
LANG="id_ID.utf8";
        no surprises
LANG="ig_NG.utf8";
        no surprises
LANG="ik_CA.utf8";
        no surprises
LANG="is_IS.utf8";
        no surprises
LANG="it_CH.utf8";
        no surprises
LANG="it_IT.utf8";
        no surprises
LANG="iu_CA.utf8";
        no surprises
LANG="iw_IL.utf8";
        no surprises
LANG="ja_JP.utf8";
        no surprises
LANG="ka_GE.utf8";
        no surprises
LANG="kk_KZ.utf8";
        no surprises
LANG="kl_GL.utf8";
        no surprises
LANG="km_KH.utf8";
        no surprises
LANG="kn_IN.utf8";
        no surprises
LANG="kok_IN.utf8";
        no surprises
LANG="ko_KR.utf8";
        no surprises
LANG="ks_IN.utf8";
        no surprises
LANG="ks_IN.utf8@devanagari";
        no surprises
LANG="ku_TR.utf8";
   64 = @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
      ^ @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`ABCDEFGHiJKLMNOPQRSTUVWXYZ{|}~
      v @abcdefghIjklmnopqrstuvwxyz[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
      ?  ........*.................      ''''''''*'''''''''''''''''
LANG="kw_GB.utf8";
        no surprises
LANG="ky_KG.utf8";
        no surprises
LANG="lb_LU.utf8";
        no surprises
LANG="lg_UG.utf8";
        no surprises
LANG="li_BE.utf8";
        no surprises
LANG="lij_IT.utf8";
        no surprises
LANG="li_NL.utf8";
        no surprises
LANG="lo_LA.utf8";
        no surprises
LANG="lt_LT.utf8";
        no surprises
LANG="lv_LV.utf8";
        no surprises
LANG="lzh_TW.utf8";
        no surprises
LANG="mag_IN.utf8";
        no surprises
LANG="mai_IN.utf8";
        no surprises
LANG="mg_MG.utf8";
        no surprises
LANG="mhr_RU.utf8";
        no surprises
LANG="mi_NZ.utf8";
        no surprises
LANG="mk_MK.utf8";
        no surprises
LANG="ml_IN.utf8";
        no surprises
LANG="mni_IN.utf8";
        no surprises
LANG="mn_MN.utf8";
        no surprises
LANG="mr_IN.utf8";
        no surprises
LANG="ms_MY.utf8";
        no surprises
LANG="mt_MT.utf8";
        no surprises
LANG="my_MM.utf8";
        no surprises
LANG="nan_TW.utf8";
        no surprises
LANG="nan_TW.utf8@latin";
        no surprises
LANG="nb_NO.utf8";
        no surprises
LANG="nds_DE.utf8";
        no surprises
LANG="nds_NL.utf8";
        no surprises
LANG="ne_NP.utf8";
        no surprises
LANG="nhn_MX.utf8";
        no surprises
LANG="niu_NU.utf8";
        no surprises
LANG="niu_NZ.utf8";
        no surprises
LANG="nl_AW.utf8";
        no surprises
LANG="nl_BE.utf8";
        no surprises
LANG="nl_NL.utf8";
        no surprises
LANG="nn_NO.utf8";
        no surprises
LANG="nr_ZA.utf8";
        no surprises
LANG="nso_ZA.utf8";
        no surprises
LANG="oc_FR.utf8";
        no surprises
LANG="om_ET.utf8";
        no surprises
LANG="om_KE.utf8";
        no surprises
LANG="or_IN.utf8";
        no surprises
LANG="os_RU.utf8";
        no surprises
LANG="pa_IN.utf8";
        no surprises
LANG="pap_AN.utf8";
        no surprises
LANG="pap_AW.utf8";
        no surprises
LANG="pap_CW.utf8";
        no surprises
LANG="pa_PK.utf8";
        no surprises
LANG="pl_PL.utf8";
        no surprises
LANG="ps_AF.utf8";
        no surprises
LANG="pt_BR.utf8";
        no surprises
LANG="pt_PT.utf8";
        no surprises
LANG="quz_PE.utf8";
        no surprises
LANG="raj_IN.utf8";
        no surprises
LANG="ro_RO.utf8";
        no surprises
LANG="ru_RU.utf8";
        no surprises
LANG="ru_UA.utf8";
        no surprises
LANG="rw_RW.utf8";
        no surprises
LANG="sa_IN.utf8";
        no surprises
LANG="sat_IN.utf8";
        no surprises
LANG="sc_IT.utf8";
        no surprises
LANG="sd_IN.utf8";
        no surprises
LANG="sd_IN.utf8@devanagari";
        no surprises
LANG="se_NO.utf8";
        no surprises
LANG="shs_CA.utf8";
        no surprises
LANG="sid_ET.utf8";
        no surprises
LANG="si_LK.utf8";
        no surprises
LANG="sk_SK.utf8";
        no surprises
LANG="sl_SI.utf8";
        no surprises
LANG="so_DJ.utf8";
        no surprises
LANG="so_ET.utf8";
        no surprises
LANG="so_KE.utf8";
        no surprises
LANG="so_SO.utf8";
        no surprises
LANG="sq_AL.utf8";
        no surprises
LANG="sq_MK.utf8";
        no surprises
LANG="sr_ME.utf8";
        no surprises
LANG="sr_RS.utf8";
        no surprises
LANG="sr_RS.utf8@latin";
        no surprises
LANG="ss_ZA.utf8";
        no surprises
LANG="st_ZA.utf8";
        no surprises
LANG="sv_FI.utf8";
        no surprises
LANG="sv_SE.utf8";
        no surprises
LANG="sw_KE.utf8";
        no surprises
LANG="sw_TZ.utf8";
        no surprises
LANG="szl_PL.utf8";
        no surprises
LANG="ta_IN.utf8";
        no surprises
LANG="ta_LK.utf8";
        no surprises
LANG="te_IN.utf8";
        no surprises
LANG="tg_TJ.utf8";
        no surprises
LANG="the_NP.utf8";
        no surprises
LANG="th_TH.utf8";
        no surprises
LANG="ti_ER.utf8";
        no surprises
LANG="ti_ET.utf8";
        no surprises
LANG="tig_ER.utf8";
        no surprises
LANG="tk_TM.utf8";
        no surprises
LANG="tl_PH.utf8";
        no surprises
LANG="tn_ZA.utf8";
        no surprises
LANG="tr_CY.utf8";
   64 = @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
      ^ @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`ABCDEFGHiJKLMNOPQRSTUVWXYZ{|}~
      v @abcdefghIjklmnopqrstuvwxyz[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
      ?  ........*.................      ''''''''*'''''''''''''''''
LANG="tr_TR.utf8";
   64 = @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
      ^ @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`ABCDEFGHiJKLMNOPQRSTUVWXYZ{|}~
      v @abcdefghIjklmnopqrstuvwxyz[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
      ?  ........*.................      ''''''''*'''''''''''''''''
LANG="ts_ZA.utf8";
        no surprises
LANG="tt_RU.utf8";
        no surprises
LANG="tt_RU.utf8@iqtelif";
   64 = @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
      ^ @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`ABCDEFGHiJKLMNOPQRSTUVWXYZ{|}~
      v @abcdefghIjklmnopqrstuvwxyz[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
      ?  ........*.................      ''''''''*'''''''''''''''''
LANG="tu_IN.utf8";
        no surprises
LANG="ug_CN.utf8";
        no surprises
LANG="uk_UA.utf8";
        no surprises
LANG="unm_US.utf8";
        no surprises
LANG="ur_IN.utf8";
        no surprises
LANG="ur_PK.utf8";
        no surprises
LANG="uz_UZ.utf8";
        no surprises
LANG="uz_UZ.utf8@cyrillic";
        no surprises
LANG="ve_ZA.utf8";
        no surprises
LANG="vi_VN.utf8";
        no surprises
LANG="wa_BE.utf8";
        no surprises
LANG="wae_CH.utf8";
        no surprises
LANG="wal_ET.utf8";
        no surprises
LANG="wo_SN.utf8";
        no surprises
LANG="xh_ZA.utf8";
        no surprises
LANG="yi_US.utf8";
        no surprises
LANG="yo_NG.utf8";
        no surprises
LANG="yue_HK.utf8";
        no surprises
LANG="zh_CN.utf8";
        no surprises
LANG="zh_HK.utf8";
        no surprises
LANG="zh_SG.utf8";
        no surprises
LANG="zh_TW.utf8";
        no surprises
LANG="zu_ZA.utf8";
        no surprises

Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

Posted by William A Rowe Jr <wr...@rowe-clan.net>.
On Wed, Nov 25, 2015 at 9:44 PM, William A Rowe Jr <wr...@rowe-clan.net>
wrote:

> LANG="ku_TR.iso88599";
>    64 = @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
>       ^ @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`ABCDEFGHİJKLMNOPQRSTUVWXYZ{|}~
>       v @abcdefghıjklmnopqrstuvwxyz[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
>       ?  ........*.................      ''''''''*'''''''''''''''''
>   192 = ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏĞÑÒÓÔÕÖ×ØÙÚÛÜİŞßàáâãäåæçèéêëìíîïğñòóôõö÷øùúûüışÿ
>       ^ ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏĞÑÒÓÔÕÖ×ØÙÚÛÜİŞßÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏĞÑÒÓÔÕÖ÷ØÙÚÛÜIŞÿ
>       v àáâãäåæçèéêëìíîïğñòóôõö×øùúûüişßàáâãäåæçèéêëìíîïğñòóôõö÷øùúûüışÿ
>       ? ....................... .....*. ''''''''''''''''''''''' '''''*'
>

The translation here is pretty simple.  We display the ^ toupper() and the
v tolower() value of every character.  For the summary line '?', in normal
or -v verbose mode, ' ' suggests no translations at all, '.' means this ch
has a lower case translation, ' means the cc has an upper case translation,
but I strip most of these lines out while searching for the exceptional
cases...

'*' is the surprising case, the high bit character translation falls into
the ancient 0-127 code plane, or a ch 0-127 falls into the high bit plane,
or anything within the traditional 0-127 code plane translates into an
unexpected position.

LANG="mt_MT.iso88593";
  128 =                                  Ħ˘£¤ Ĥ§¨İŞĞĴ­ Ż°ħ²³´µĥ·¸ışğĵ½ ż
      ^                                  Ħ˘£¤ Ĥ§¨İŞĞĴ­ Ż°Ħ²³´µĤ·¸IŞĞĴ½ Ż
      v                                  ħ˘£¤ ĥ§¨işğĵ­ ż°ħ²³´µĥ·¸ışğĵ½ ż
      ?                                  .    .  *...  . '    '  *'''  '

The last example above seems to indicate an isprint() validation error
or utf-8 mis-assignment in iconv, somewhere in the last 16 characters
of this code table, apparently between Ĵ­ and Ż :)

Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

Posted by William A Rowe Jr <wr...@rowe-clan.net>.
On Wed, Nov 25, 2015 at 6:45 PM, William A Rowe Jr <wr...@rowe-clan.net>
wrote:

> On Nov 25, 2015 4:19 PM, "Mikhail T." <mi...@aldan.algebra.com> wrote:
> >
> > Thus, I contend, using C-library will not cause invalid results, and the
> only reason to have Apache's own implementation is performance, but not
> correctness.
>
> Well almost but wrong...
>
> The pure char-based ß processing produced no case change in my reviews of
> tolower/toupper in de_DE codeset. If you were to examine string comparison
> the collation order changes substantially.
>
> And more to the point, if tolower()/toupper() could handle not only mbcs
but multicharacter transliteration, your results would have varied.  1:1
character translations have their intrinsic limits.

> That said, I'm working up a comprehensive audit and other codeset/language
> combinations absolutely do.  Code and results forthcoming shortly.
>
As promised, here's a quick review based on the sbcs and utf8 code pages in
the very limited single-byte scope on my machine.

I did not touch the following mbcs because they require 'shift-state' to
toggle into and out of specific characters and that implies a lot of
calculated fuzzing that I didn't have time for this week.  (Since mod_ftp
explicit tls is still broken, I had no time for any of this, either ;-)  I
also didn't get to evaluating the wide chars yet that fall into the
traditional posix/c ascii range, which I still mean to do, and haven't yet
repeated this exercise on win32 or os/x, only on a somewhat multinational
configuration of fedora 22.

The source code is pretty rudimentary.  I used iconv to shove all of the
resulting text evaluation into utf-8 for the console/file output, it really
plays no part in the locality equation.  It can be adapted for testing
similar on an EBCDIC box with a bit of clever coding I never got to.

Untested: ja_JP.eucjp ja_JP.ujis japanese.euc ko_KR.euckr korean.euc
zh_CN.gb18030 zh_CN.gb2312 zh_CN.gbk zh_HK.big5hkscs zh_SG.gb2312 zh_SG.gbk
zh_TW.big5 zh_TW.euctw

Tested and exceptional results noted (source code attached);

LANG="aa_DJ.iso88591";
        no surprises
LANG="af_ZA.iso88591";
        no surprises
LANG="an_ES.iso885915";
        no surprises
LANG="ar_AE.iso88596";
        no surprises
LANG="ar_BH.iso88596";
        no surprises
LANG="ar_DZ.iso88596";
        no surprises
LANG="ar_EG.iso88596";
        no surprises
LANG="ar_IQ.iso88596";
        no surprises
LANG="ar_JO.iso88596";
        no surprises
LANG="ar_KW.iso88596";
        no surprises
LANG="ar_LB.iso88596";
        no surprises
LANG="ar_LY.iso88596";
        no surprises
LANG="ar_MA.iso88596";
        no surprises
LANG="ar_OM.iso88596";
        no surprises
LANG="ar_QA.iso88596";
        no surprises
LANG="ar_SA.iso88596";
        no surprises
LANG="ar_SD.iso88596";
        no surprises
LANG="ar_SY.iso88596";
        no surprises
LANG="ar_TN.iso88596";
        no surprises
LANG="ar_YE.iso88596";
        no surprises
LANG="ast_ES.iso885915";
        no surprises
LANG="be_BY.cp1251";
        no surprises
LANG="bg_BG.cp1251";
        no surprises
LANG="br_FR.iso88591";
        no surprises
LANG="br_FR.iso885915@euro";
        no surprises
LANG="bs_BA.iso88592";
        no surprises
LANG="ca_AD.iso885915";
        no surprises
LANG="ca_ES.iso88591";
        no surprises
LANG="ca_ES.iso885915@euro";
        no surprises
LANG="ca_FR.iso885915";
        no surprises
LANG="ca_IT.iso885915";
        no surprises
LANG="cs_CZ.iso88592";
        no surprises
LANG="cy_GB.iso885914";
        no surprises
LANG="da_DK.iso88591";
        no surprises
LANG="da_DK.iso885915";
        no surprises
LANG="de_AT.iso88591";
        no surprises
LANG="de_AT.iso885915@euro";
        no surprises
LANG="de_BE.iso88591";
        no surprises
LANG="de_BE.iso885915@euro";
        no surprises
LANG="de_CH.iso88591";
        no surprises
LANG="de_DE.iso88591";
        no surprises
LANG="de_DE.iso885915@euro";
        no surprises
LANG="de_LU.iso88591";
        no surprises
LANG="de_LU.iso885915@euro";
        no surprises
LANG="el_CY.iso88597";
        no surprises
LANG="el_GR.iso88597";
        no surprises
LANG="en_AU.iso88591";
        no surprises
LANG="en_BW.iso88591";
        no surprises
LANG="en_CA.iso88591";
        no surprises
LANG="en_DK.iso88591";
        no surprises
LANG="en_GB.iso88591";
        no surprises
LANG="en_GB.iso885915";
        no surprises
LANG="en_HK.iso88591";
        no surprises
LANG="en_IE.iso88591";
        no surprises
LANG="en_IE.iso885915@euro";
        no surprises
LANG="en_NZ.iso88591";
        no surprises
LANG="en_PH.iso88591";
        no surprises
LANG="en_SG.iso88591";
        no surprises
LANG="en_US.iso88591";
        no surprises
LANG="en_US.iso885915";
        no surprises
LANG="en_ZA.iso88591";
        no surprises
LANG="en_ZW.iso88591";
        no surprises
LANG="es_AR.iso88591";
        no surprises
LANG="es_BO.iso88591";
        no surprises
LANG="es_CL.iso88591";
        no surprises
LANG="es_CO.iso88591";
        no surprises
LANG="es_CR.iso88591";
        no surprises
LANG="es_DO.iso88591";
        no surprises
LANG="es_EC.iso88591";
        no surprises
LANG="es_ES.iso88591";
        no surprises
LANG="es_ES.iso885915@euro";
        no surprises
LANG="es_GT.iso88591";
        no surprises
LANG="es_HN.iso88591";
        no surprises
LANG="es_MX.iso88591";
        no surprises
LANG="es_NI.iso88591";
        no surprises
LANG="es_PA.iso88591";
        no surprises
LANG="es_PE.iso88591";
        no surprises
LANG="es_PR.iso88591";
        no surprises
LANG="es_PY.iso88591";
        no surprises
LANG="es_SV.iso88591";
        no surprises
LANG="es_US.iso88591";
        no surprises
LANG="es_UY.iso88591";
        no surprises
LANG="es_VE.iso88591";
        no surprises
LANG="et_EE.iso88591";
        no surprises
LANG="et_EE.iso885915";
        no surprises
LANG="eu_ES.iso88591";
        no surprises
LANG="eu_ES.iso885915@euro";
        no surprises
LANG="fi_FI.iso88591";
        no surprises
LANG="fi_FI.iso885915@euro";
        no surprises
LANG="fo_FO.iso88591";
        no surprises
LANG="fr_BE.iso88591";
        no surprises
LANG="fr_BE.iso885915@euro";
        no surprises
LANG="fr_CA.iso88591";
        no surprises
LANG="fr_CH.iso88591";
        no surprises
LANG="fr_FR.iso88591";
        no surprises
LANG="fr_FR.iso885915@euro";
        no surprises
LANG="fr_LU.iso88591";
        no surprises
LANG="fr_LU.iso885915@euro";
        no surprises
LANG="ga_IE.iso88591";
        no surprises
LANG="ga_IE.iso885915@euro";
        no surprises
LANG="gd_GB.iso885915";
        no surprises
LANG="gl_ES.iso88591";
        no surprises
LANG="gl_ES.iso885915@euro";
        no surprises
LANG="gv_GB.iso88591";
        no surprises
LANG="he_IL.iso88598";
        no surprises
LANG="hr_HR.iso88592";
        no surprises
LANG="hsb_DE.iso88592";
        no surprises
LANG="hu_HU.iso88592";
        no surprises
LANG="hy_AM.armscii8";
        no surprises
LANG="id_ID.iso88591";
        no surprises
LANG="is_IS.iso88591";
        no surprises
LANG="it_CH.iso88591";
        no surprises
LANG="it_IT.iso88591";
        no surprises
LANG="it_IT.iso885915@euro";
        no surprises
LANG="iw_IL.iso88598";
        no surprises
LANG="ka_GE.georgianps";
        no surprises
LANG="kk_KZ.pt154";
        no surprises
LANG="kl_GL.iso88591";
        no surprises
LANG="ku_TR.iso88599";
   64 = @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
      ^ @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`ABCDEFGHİJKLMNOPQRSTUVWXYZ{|}~
      v @abcdefghıjklmnopqrstuvwxyz[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
      ?  ........*.................      ''''''''*'''''''''''''''''
  192 = ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏĞÑÒÓÔÕÖ×ØÙÚÛÜİŞßàáâãäåæçèéêëìíîïğñòóôõö÷øùúûüışÿ
      ^ ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏĞÑÒÓÔÕÖ×ØÙÚÛÜİŞßÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏĞÑÒÓÔÕÖ÷ØÙÚÛÜIŞÿ
      v àáâãäåæçèéêëìíîïğñòóôõö×øùúûüişßàáâãäåæçèéêëìíîïğñòóôõö÷øùúûüışÿ
      ? ....................... .....*. ''''''''''''''''''''''' '''''*'
LANG="kw_GB.iso88591";
        no surprises
LANG="lg_UG.iso885910";
        no surprises
LANG="lt_LT.iso885913";
        no surprises
LANG="lv_LV.iso885913";
        no surprises
LANG="mg_MG.iso885915";
        no surprises
LANG="mi_NZ.iso885913";
        no surprises
LANG="mk_MK.iso88595";
        no surprises
LANG="ms_MY.iso88591";
        no surprises
LANG="mt_MT.iso88593";
  128 =                                  Ħ˘£¤ Ĥ§¨İŞĞĴ­ Ż°ħ²³´µĥ·¸ışğĵ½ ż
      ^                                  Ħ˘£¤ Ĥ§¨İŞĞĴ­ Ż°Ħ²³´µĤ·¸IŞĞĴ½ Ż
      v                                  ħ˘£¤ ĥ§¨işğĵ­ ż°ħ²³´µĥ·¸ışğĵ½ ż
      ?                                  .    .  *...  . '    '  *'''  '
LANG="nb_NO.iso88591";
        no surprises
LANG="nl_BE.iso88591";
        no surprises
LANG="nl_BE.iso885915@euro";
        no surprises
LANG="nl_NL.iso88591";
        no surprises
LANG="nl_NL.iso885915@euro";
        no surprises
LANG="nn_NO.iso88591";
        no surprises
LANG="(null)";
        no surprises
LANG="oc_FR.iso88591";
        no surprises
LANG="om_KE.iso88591";
        no surprises
LANG="pl_PL.iso88592";
        no surprises
LANG="pt_BR.iso88591";
        no surprises
LANG="pt_PT.iso88591";
        no surprises
LANG="pt_PT.iso885915@euro";
        no surprises
LANG="ro_RO.iso88592";
        no surprises
LANG="ru_RU.iso88595";
        no surprises
LANG="ru_RU.koi8r";
        no surprises
LANG="ru_UA.koi8u";
        no surprises
LANG="sk_SK.iso88592";
        no surprises
LANG="sl_SI.iso88592";
        no surprises
LANG="so_DJ.iso88591";
        no surprises
LANG="so_KE.iso88591";
        no surprises
LANG="so_SO.iso88591";
        no surprises
LANG="sq_AL.iso88591";
        no surprises
LANG="st_ZA.iso88591";
        no surprises
LANG="sv_FI.iso88591";
        no surprises
LANG="sv_FI.iso885915@euro";
        no surprises
LANG="sv_SE.iso88591";
        no surprises
LANG="sv_SE.iso885915";
        no surprises
LANG="tg_TJ.koi8t";
        no surprises
LANG="th_TH.tis620";
        no surprises
LANG="tl_PH.iso88591";
        no surprises
LANG="tr_CY.iso88599";
   64 = @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
      ^ @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`ABCDEFGHİJKLMNOPQRSTUVWXYZ{|}~
      v @abcdefghıjklmnopqrstuvwxyz[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
      ?  ........*.................      ''''''''*'''''''''''''''''
  192 = ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏĞÑÒÓÔÕÖ×ØÙÚÛÜİŞßàáâãäåæçèéêëìíîïğñòóôõö÷øùúûüışÿ
      ^ ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏĞÑÒÓÔÕÖ×ØÙÚÛÜİŞßÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏĞÑÒÓÔÕÖ÷ØÙÚÛÜIŞÿ
      v àáâãäåæçèéêëìíîïğñòóôõö×øùúûüişßàáâãäåæçèéêëìíîïğñòóôõö÷øùúûüışÿ
      ? ....................... .....*. ''''''''''''''''''''''' '''''*'
LANG="tr_TR.iso88599";
   64 = @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
      ^ @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`ABCDEFGHİJKLMNOPQRSTUVWXYZ{|}~
      v @abcdefghıjklmnopqrstuvwxyz[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
      ?  ........*.................      ''''''''*'''''''''''''''''
  192 = ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏĞÑÒÓÔÕÖ×ØÙÚÛÜİŞßàáâãäåæçèéêëìíîïğñòóôõö÷øùúûüışÿ
      ^ ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏĞÑÒÓÔÕÖ×ØÙÚÛÜİŞßÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏĞÑÒÓÔÕÖ÷ØÙÚÛÜIŞÿ
      v àáâãäåæçèéêëìíîïğñòóôõö×øùúûüişßàáâãäåæçèéêëìíîïğñòóôõö÷øùúûüışÿ
      ? ....................... .....*. ''''''''''''''''''''''' '''''*'
LANG="uk_UA.koi8u";
        no surprises
LANG="uz_UZ.iso88591";
        no surprises
LANG="wa_BE.iso88591";
        no surprises
LANG="wa_BE.iso885915@euro";
        no surprises
LANG="xh_ZA.iso88591";
        no surprises
LANG="yi_US.cp1255";
        no surprises
LANG="zu_ZA.iso88591";
        no surprises
LANG="aa_DJ.utf8";
        no surprises
LANG="aa_ER.utf8";
        no surprises
LANG="aa_ER.utf8@saaho";
        no surprises
LANG="aa_ET.utf8";
        no surprises
LANG="af_ZA.utf8";
        no surprises
LANG="ak_GH.utf8";
        no surprises
LANG="am_ET.utf8";
        no surprises
LANG="an_ES.utf8";
        no surprises
LANG="anp_IN.utf8";
        no surprises
LANG="ar_AE.utf8";
        no surprises
LANG="ar_BH.utf8";
        no surprises
LANG="ar_DZ.utf8";
        no surprises
LANG="ar_EG.utf8";
        no surprises
LANG="ar_IN.utf8";
        no surprises
LANG="ar_IQ.utf8";
        no surprises
LANG="ar_JO.utf8";
        no surprises
LANG="ar_KW.utf8";
        no surprises
LANG="ar_LB.utf8";
        no surprises
LANG="ar_LY.utf8";
        no surprises
LANG="ar_MA.utf8";
        no surprises
LANG="ar_OM.utf8";
        no surprises
LANG="ar_QA.utf8";
        no surprises
LANG="ar_SA.utf8";
        no surprises
LANG="ar_SD.utf8";
        no surprises
LANG="ar_SS.utf8";
        no surprises
LANG="ar_SY.utf8";
        no surprises
LANG="ar_TN.utf8";
        no surprises
LANG="ar_YE.utf8";
        no surprises
LANG="as_IN.utf8";
        no surprises
LANG="ast_ES.utf8";
        no surprises
LANG="ayc_PE.utf8";
        no surprises
LANG="az_AZ.utf8";
   64 = @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
      ^ @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`ABCDEFGHiJKLMNOPQRSTUVWXYZ{|}~
      v @abcdefghIjklmnopqrstuvwxyz[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
      ?  ........*.................      ''''''''*'''''''''''''''''
LANG="be_BY.utf8";
        no surprises
LANG="be_BY.utf8@latin";
        no surprises
LANG="bem_ZM.utf8";
        no surprises
LANG="ber_DZ.utf8";
        no surprises
LANG="ber_MA.utf8";
        no surprises
LANG="bg_BG.utf8";
        no surprises
LANG="bh_IN.utf8";
        no surprises
LANG="bho_IN.utf8";
        no surprises
LANG="bn_BD.utf8";
        no surprises
LANG="bn_IN.utf8";
        no surprises
LANG="bo_CN.utf8";
        no surprises
LANG="bo_IN.utf8";
        no surprises
LANG="br_FR.utf8";
        no surprises
LANG="brx_IN.utf8";
        no surprises
LANG="bs_BA.utf8";
        no surprises
LANG="byn_ER.utf8";
        no surprises
LANG="ca_AD.utf8";
        no surprises
LANG="ca_ES.utf8";
        no surprises
LANG="ca_FR.utf8";
        no surprises
LANG="ca_IT.utf8";
        no surprises
LANG="ce_RU.utf8";
        no surprises
LANG="cmn_TW.utf8";
        no surprises
LANG="crh_UA.utf8";
   64 = @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
      ^ @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`ABCDEFGHiJKLMNOPQRSTUVWXYZ{|}~
      v @abcdefghIjklmnopqrstuvwxyz[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
      ?  ........*.................      ''''''''*'''''''''''''''''
LANG="csb_PL.utf8";
        no surprises
LANG="cs_CZ.utf8";
        no surprises
LANG="cv_RU.utf8";
        no surprises
LANG="cy_GB.utf8";
        no surprises
LANG="da_DK.utf8";
        no surprises
LANG="de_AT.utf8";
        no surprises
LANG="de_BE.utf8";
        no surprises
LANG="de_CH.utf8";
        no surprises
LANG="de_DE.utf8";
        no surprises
LANG="de_LU.utf8";
        no surprises
LANG="doi_IN.utf8";
        no surprises
LANG="dv_MV.utf8";
        no surprises
LANG="dz_BT.utf8";
        no surprises
LANG="el_CY.utf8";
        no surprises
LANG="el_GR.utf8";
        no surprises
LANG="en_AG.utf8";
        no surprises
LANG="en_AU.utf8";
        no surprises
LANG="en_BW.utf8";
        no surprises
LANG="en_CA.utf8";
        no surprises
LANG="en_DK.utf8";
        no surprises
LANG="en_GB.utf8";
        no surprises
LANG="en_HK.utf8";
        no surprises
LANG="en_IE.utf8";
        no surprises
LANG="en_IN.utf8";
        no surprises
LANG="en_NG.utf8";
        no surprises
LANG="en_NZ.utf8";
        no surprises
LANG="en_PH.utf8";
        no surprises
LANG="en_SG.utf8";
        no surprises
LANG="en_US.utf8";
        no surprises
LANG="en_ZA.utf8";
        no surprises
LANG="en_ZM.utf8";
        no surprises
LANG="en_ZW.utf8";
        no surprises
LANG="es_AR.utf8";
        no surprises
LANG="es_BO.utf8";
        no surprises
LANG="es_CL.utf8";
        no surprises
LANG="es_CO.utf8";
        no surprises
LANG="es_CR.utf8";
        no surprises
LANG="es_CU.utf8";
        no surprises
LANG="es_DO.utf8";
        no surprises
LANG="es_EC.utf8";
        no surprises
LANG="es_ES.utf8";
        no surprises
LANG="es_GT.utf8";
        no surprises
LANG="es_HN.utf8";
        no surprises
LANG="es_MX.utf8";
        no surprises
LANG="es_NI.utf8";
        no surprises
LANG="es_PA.utf8";
        no surprises
LANG="es_PE.utf8";
        no surprises
LANG="es_PR.utf8";
        no surprises
LANG="es_PY.utf8";
        no surprises
LANG="es_SV.utf8";
        no surprises
LANG="es_US.utf8";
        no surprises
LANG="es_UY.utf8";
        no surprises
LANG="es_VE.utf8";
        no surprises
LANG="et_EE.utf8";
        no surprises
LANG="eu_ES.utf8";
        no surprises
LANG="fa_IR.utf8";
        no surprises
LANG="ff_SN.utf8";
        no surprises
LANG="fi_FI.utf8";
        no surprises
LANG="fil_PH.utf8";
        no surprises
LANG="fo_FO.utf8";
        no surprises
LANG="fr_BE.utf8";
        no surprises
LANG="fr_CA.utf8";
        no surprises
LANG="fr_CH.utf8";
        no surprises
LANG="fr_FR.utf8";
        no surprises
LANG="fr_LU.utf8";
        no surprises
LANG="fur_IT.utf8";
        no surprises
LANG="fy_DE.utf8";
        no surprises
LANG="fy_NL.utf8";
        no surprises
LANG="ga_IE.utf8";
        no surprises
LANG="gd_GB.utf8";
        no surprises
LANG="gez_ER.utf8";
        no surprises
LANG="gez_ER.utf8@abegede";
        no surprises
LANG="gez_ET.utf8";
        no surprises
LANG="gez_ET.utf8@abegede";
        no surprises
LANG="gl_ES.utf8";
        no surprises
LANG="gu_IN.utf8";
        no surprises
LANG="gv_GB.utf8";
        no surprises
LANG="hak_TW.utf8";
        no surprises
LANG="ha_NG.utf8";
        no surprises
LANG="he_IL.utf8";
        no surprises
LANG="hi_IN.utf8";
        no surprises
LANG="hne_IN.utf8";
        no surprises
LANG="hr_HR.utf8";
        no surprises
LANG="hsb_DE.utf8";
        no surprises
LANG="ht_HT.utf8";
        no surprises
LANG="hu_HU.utf8";
        no surprises
LANG="hy_AM.utf8";
        no surprises
LANG="ia_FR.utf8";
        no surprises
LANG="id_ID.utf8";
        no surprises
LANG="ig_NG.utf8";
        no surprises
LANG="ik_CA.utf8";
        no surprises
LANG="is_IS.utf8";
        no surprises
LANG="it_CH.utf8";
        no surprises
LANG="it_IT.utf8";
        no surprises
LANG="iu_CA.utf8";
        no surprises
LANG="iw_IL.utf8";
        no surprises
LANG="ja_JP.utf8";
        no surprises
LANG="ka_GE.utf8";
        no surprises
LANG="kk_KZ.utf8";
        no surprises
LANG="kl_GL.utf8";
        no surprises
LANG="km_KH.utf8";
        no surprises
LANG="kn_IN.utf8";
        no surprises
LANG="kok_IN.utf8";
        no surprises
LANG="ko_KR.utf8";
        no surprises
LANG="ks_IN.utf8";
        no surprises
LANG="ks_IN.utf8@devanagari";
        no surprises
LANG="ku_TR.utf8";
   64 = @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
      ^ @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`ABCDEFGHiJKLMNOPQRSTUVWXYZ{|}~
      v @abcdefghIjklmnopqrstuvwxyz[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
      ?  ........*.................      ''''''''*'''''''''''''''''
LANG="kw_GB.utf8";
        no surprises
LANG="ky_KG.utf8";
        no surprises
LANG="lb_LU.utf8";
        no surprises
LANG="lg_UG.utf8";
        no surprises
LANG="li_BE.utf8";
        no surprises
LANG="lij_IT.utf8";
        no surprises
LANG="li_NL.utf8";
        no surprises
LANG="lo_LA.utf8";
        no surprises
LANG="lt_LT.utf8";
        no surprises
LANG="lv_LV.utf8";
        no surprises
LANG="lzh_TW.utf8";
        no surprises
LANG="mag_IN.utf8";
        no surprises
LANG="mai_IN.utf8";
        no surprises
LANG="mg_MG.utf8";
        no surprises
LANG="mhr_RU.utf8";
        no surprises
LANG="mi_NZ.utf8";
        no surprises
LANG="mk_MK.utf8";
        no surprises
LANG="ml_IN.utf8";
        no surprises
LANG="mni_IN.utf8";
        no surprises
LANG="mn_MN.utf8";
        no surprises
LANG="mr_IN.utf8";
        no surprises
LANG="ms_MY.utf8";
        no surprises
LANG="mt_MT.utf8";
        no surprises
LANG="my_MM.utf8";
        no surprises
LANG="nan_TW.utf8";
        no surprises
LANG="nan_TW.utf8@latin";
        no surprises
LANG="nb_NO.utf8";
        no surprises
LANG="nds_DE.utf8";
        no surprises
LANG="nds_NL.utf8";
        no surprises
LANG="ne_NP.utf8";
        no surprises
LANG="nhn_MX.utf8";
        no surprises
LANG="niu_NU.utf8";
        no surprises
LANG="niu_NZ.utf8";
        no surprises
LANG="nl_AW.utf8";
        no surprises
LANG="nl_BE.utf8";
        no surprises
LANG="nl_NL.utf8";
        no surprises
LANG="nn_NO.utf8";
        no surprises
LANG="nr_ZA.utf8";
        no surprises
LANG="nso_ZA.utf8";
        no surprises
LANG="oc_FR.utf8";
        no surprises
LANG="om_ET.utf8";
        no surprises
LANG="om_KE.utf8";
        no surprises
LANG="or_IN.utf8";
        no surprises
LANG="os_RU.utf8";
        no surprises
LANG="pa_IN.utf8";
        no surprises
LANG="pap_AN.utf8";
        no surprises
LANG="pap_AW.utf8";
        no surprises
LANG="pap_CW.utf8";
        no surprises
LANG="pa_PK.utf8";
        no surprises
LANG="pl_PL.utf8";
        no surprises
LANG="ps_AF.utf8";
        no surprises
LANG="pt_BR.utf8";
        no surprises
LANG="pt_PT.utf8";
        no surprises
LANG="quz_PE.utf8";
        no surprises
LANG="raj_IN.utf8";
        no surprises
LANG="ro_RO.utf8";
        no surprises
LANG="ru_RU.utf8";
        no surprises
LANG="ru_UA.utf8";
        no surprises
LANG="rw_RW.utf8";
        no surprises
LANG="sa_IN.utf8";
        no surprises
LANG="sat_IN.utf8";
        no surprises
LANG="sc_IT.utf8";
        no surprises
LANG="sd_IN.utf8";
        no surprises
LANG="sd_IN.utf8@devanagari";
        no surprises
LANG="se_NO.utf8";
        no surprises
LANG="shs_CA.utf8";
        no surprises
LANG="sid_ET.utf8";
        no surprises
LANG="si_LK.utf8";
        no surprises
LANG="sk_SK.utf8";
        no surprises
LANG="sl_SI.utf8";
        no surprises
LANG="so_DJ.utf8";
        no surprises
LANG="so_ET.utf8";
        no surprises
LANG="so_KE.utf8";
        no surprises
LANG="so_SO.utf8";
        no surprises
LANG="sq_AL.utf8";
        no surprises
LANG="sq_MK.utf8";
        no surprises
LANG="sr_ME.utf8";
        no surprises
LANG="sr_RS.utf8";
        no surprises
LANG="sr_RS.utf8@latin";
        no surprises
LANG="ss_ZA.utf8";
        no surprises
LANG="st_ZA.utf8";
        no surprises
LANG="sv_FI.utf8";
        no surprises
LANG="sv_SE.utf8";
        no surprises
LANG="sw_KE.utf8";
        no surprises
LANG="sw_TZ.utf8";
        no surprises
LANG="szl_PL.utf8";
        no surprises
LANG="ta_IN.utf8";
        no surprises
LANG="ta_LK.utf8";
        no surprises
LANG="te_IN.utf8";
        no surprises
LANG="tg_TJ.utf8";
        no surprises
LANG="the_NP.utf8";
        no surprises
LANG="th_TH.utf8";
        no surprises
LANG="ti_ER.utf8";
        no surprises
LANG="ti_ET.utf8";
        no surprises
LANG="tig_ER.utf8";
        no surprises
LANG="tk_TM.utf8";
        no surprises
LANG="tl_PH.utf8";
        no surprises
LANG="tn_ZA.utf8";
        no surprises
LANG="tr_CY.utf8";
   64 = @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
      ^ @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`ABCDEFGHiJKLMNOPQRSTUVWXYZ{|}~
      v @abcdefghIjklmnopqrstuvwxyz[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
      ?  ........*.................      ''''''''*'''''''''''''''''
LANG="tr_TR.utf8";
   64 = @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
      ^ @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`ABCDEFGHiJKLMNOPQRSTUVWXYZ{|}~
      v @abcdefghIjklmnopqrstuvwxyz[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
      ?  ........*.................      ''''''''*'''''''''''''''''
LANG="ts_ZA.utf8";
        no surprises
LANG="tt_RU.utf8";
        no surprises
LANG="tt_RU.utf8@iqtelif";
   64 = @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
      ^ @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`ABCDEFGHiJKLMNOPQRSTUVWXYZ{|}~
      v @abcdefghIjklmnopqrstuvwxyz[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
      ?  ........*.................      ''''''''*'''''''''''''''''
LANG="tu_IN.utf8";
        no surprises
LANG="ug_CN.utf8";
        no surprises
LANG="uk_UA.utf8";
        no surprises
LANG="unm_US.utf8";
        no surprises
LANG="ur_IN.utf8";
        no surprises
LANG="ur_PK.utf8";
        no surprises
LANG="uz_UZ.utf8";
        no surprises
LANG="uz_UZ.utf8@cyrillic";
        no surprises
LANG="ve_ZA.utf8";
        no surprises
LANG="vi_VN.utf8";
        no surprises
LANG="wa_BE.utf8";
        no surprises
LANG="wae_CH.utf8";
        no surprises
LANG="wal_ET.utf8";
        no surprises
LANG="wo_SN.utf8";
        no surprises
LANG="xh_ZA.utf8";
        no surprises
LANG="yi_US.utf8";
        no surprises
LANG="yo_NG.utf8";
        no surprises
LANG="yue_HK.utf8";
        no surprises
LANG="zh_CN.utf8";
        no surprises
LANG="zh_HK.utf8";
        no surprises
LANG="zh_SG.utf8";
        no surprises
LANG="zh_TW.utf8";
        no surprises
LANG="zu_ZA.utf8";
        no surprises

Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

Posted by William A Rowe Jr <wr...@rowe-clan.net>.
On Nov 25, 2015 4:19 PM, "Mikhail T." <mi...@aldan.algebra.com> wrote:
>
>
>>
>> So, the concern is, some hypothetical header, such as X-ASSIGN-TO may,
after going through the locale-aware strtolower() unexpectedly become
x-aßign-to?
>
> I just tested the above on both FreeBSD and Linux, and the results are
encouraging:
>>
>> % echo STRASSE | env LANG=de_DE.ISO8859 tr '[[:upper:]]' '[[:lower:]]'
>> strasse
>
> Thus, I contend, using C-library will not cause invalid results, and the
only reason to have Apache's own implementation is performance, but not
correctness.

Well almost but wrong...

The pure char-based ß processing produced no case change in my reviews of
tolower/toupper in de_DE codeset. If you were to examine string comparison
the collation order changes substantially.

That said, I'm working up a comprehensive audit and other codeset/language
combinations absolutely do.  Code and results forthcoming shortly.

As long as everyone keeps their fingers off the setlocale()/trigger, it's
all fine.

Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

Posted by "Mikhail T." <mi...@aldan.algebra.com>.
On 25.11.2015 18:21, Bert Huijben wrote:
> That Turkish ‘I’ problem is the only case I know of where the
> collation actually changes behavior within the usual western alphabet
> of ASCII characters.
Argh, yes, I see now, what the problem would be... Thank you,

    -mi


RE: apr_token_* conclusions (was: Better casecmpstr[n]?)

Posted by Bert Huijben <be...@qqmail.nl>.
See http://www.siao2.com/2004/12/03/274288.aspx

And http://www.siao2.com/2013/04/04/10407543.aspx

For some background and related bugs in several products.

 

I hope this blog will stay alive. (The author passed away recently)

 

                Bert

 

From: Bert Huijben [mailto:bert@qqmail.nl] 
Sent: donderdag 26 november 2015 00:22
To: dev@httpd.apache.org
Subject: RE: apr_token_* conclusions (was: Better casecmpstr[n]?)

 

The example was the other way around. Changing SS to ß is not a valid transform, but the other way is. There are also transforms on the combined AE characters, etc.

 

That Turkish ‘I’ problem is the only case I know of where the collation actually changes behavior within the usual western alphabet of ASCII characters.

 

                Bert

 

 

From: Mikhail T. [mailto:mi+thun@aldan.algebra.com] 
Sent: woensdag 25 november 2015 23:19
To: dev@httpd.apache.org <ma...@httpd.apache.org> 
Subject: Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

 

On 25.11.2015 14:10, Mikhail T. wrote:

Two variables, LC_CTYPE and LC_COLLATE control this text processing behavior.  The above is the correct lower case transliteration for Turkish.  In German, the upper case correspondence of sharp-S ß is 'SS', but multi-char translation is not provided by the simple tolower/toupper functions.

So, the concern is, some hypothetical header, such as X-ASSIGN-TO may, after going through the locale-aware strtolower() unexpectedly become x-aßign-to?

I just tested the above on both FreeBSD and Linux, and the results are encouraging:

% echo STRASSE | env LANG=de_DE.ISO8859 tr '[[:upper:]]' '[[:lower:]]'
strasse

Thus, I contend, using C-library will not cause invalid results, and the only reason to have Apache's own implementation is performance, but not correctness.

-mi


RE: apr_token_* conclusions (was: Better casecmpstr[n]?)

Posted by Bert Huijben <be...@qqmail.nl>.
The example was the other way around. Changing SS to ß is not a valid transform, but the other way is. There are also transforms on the combined AE characters, etc.

 

That Turkish ‘I’ problem is the only case I know of where the collation actually changes behavior within the usual western alphabet of ASCII characters.

 

                Bert

 

 

From: Mikhail T. [mailto:mi+thun@aldan.algebra.com] 
Sent: woensdag 25 november 2015 23:19
To: dev@httpd.apache.org
Subject: Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

 

On 25.11.2015 14:10, Mikhail T. wrote:

Two variables, LC_CTYPE and LC_COLLATE control this text processing behavior.  The above is the correct lower case transliteration for Turkish.  In German, the upper case correspondence of sharp-S ß is 'SS', but multi-char translation is not provided by the simple tolower/toupper functions.

So, the concern is, some hypothetical header, such as X-ASSIGN-TO may, after going through the locale-aware strtolower() unexpectedly become x-aßign-to?

I just tested the above on both FreeBSD and Linux, and the results are encouraging:

% echo STRASSE | env LANG=de_DE.ISO8859 tr '[[:upper:]]' '[[:lower:]]'
strasse

Thus, I contend, using C-library will not cause invalid results, and the only reason to have Apache's own implementation is performance, but not correctness.

-mi


Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

Posted by "Mikhail T." <mi...@aldan.algebra.com>.
On 25.11.2015 14:10, Mikhail T. wrote:
>>
>> Two variables, LC_CTYPE and LC_COLLATE control this text processing
>> behavior.  The above is the correct lower case transliteration for
>> Turkish.  In German, the upper case correspondence of sharp-S ß is
>> 'SS', but multi-char translation is not provided by the simple
>> tolower/toupper functions.
>>
> So, the concern is, some hypothetical header, such as X-ASSIGN-TO may,
> after going through the locale-aware strtolower() unexpectedly become
> x-aßign-to?
I just tested the above on both FreeBSD and Linux, and the results are
encouraging:

    % echo STRASSE | env LANG=de_DE.ISO8859 tr '[[:upper:]]' '[[:lower:]]'
    strasse

Thus, I contend, using C-library will not cause invalid results, and the
only reason to have Apache's own implementation is performance, but not
correctness.

    -mi


Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

Posted by "Mikhail T." <mi...@aldan.algebra.com>.
On 25.11.2015 13:16, William A Rowe Jr wrote:
>
> Two variables, LC_CTYPE and LC_COLLATE control this text processing
> behavior.  The above is the correct lower case transliteration for
> Turkish.  In German, the upper case correspondence of sharp-S ß is
> 'SS', but multi-char translation is not provided by the simple
> tolower/toupper functions.
>
So, the concern is, some hypothetical header, such as X-ASSIGN-TO may,
after going through the locale-aware strtolower() unexpectedly become
x-aßign-to?

    -mi


Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

Posted by William A Rowe Jr <wr...@rowe-clan.net>.
On Nov 25, 2015 12:00, "Mikhail T." <mi...@aldan.algebra.com> wrote:
>
> On 25.11.2015 12:42, William A Rowe Jr wrote:
>>
>> If the script switches setlocale to turkish, for example, our
forced-lowercase content-type conversion
>> will cause "IMAGE/GIF" to become "ımage/gıf", clearly not what the specs
intended.
>
> I'm sorry, could you elaborate on this? Would not strtolower(3) convert
"IMAGE/GIF" to "image/gif" in all locales -- including "C"? At least, in
all single-byte charsets -- such as the Turkish ISO 8859-9? Yes, the
function will act differently on the strings containing octets above 127,
but those would occur neither in content-types nor in header-names...

Two variables, LC_CTYPE and LC_COLLATE control this text processing
behavior.  The above is the correct lower case transliteration for
Turkish.  In German, the upper case correspondence of sharp-S ß is 'SS',
but multi-char translation is not provided by the simple tolower/toupper
functions.

Consider this is a function of language, and not of 'charset' per-say.  The
same charset behaves differently based on the locale's language.

>> Adding unambiguous token handling functions would be good for the few
case-insensitive string comparison, string folding, and search functions.
It allows the spec-consumer to trust their string processing.
>
> Up until now, I thought, the thread was about coming up with a short-cut
-- an optimization for processing tokens, like request-headers, which are
known to be in US-ASCII anyway and where using locale-aware functions is
simply wasteful -- but not incorrect.

Partially so, that was the motivation behind the proposal.  Apparently OS/X
in particular has a slow implementation of strcasecmp even running under
the Posix locale.

> You seem to imply, the locale-aware functions might be doing the wrong
thing some times -- and this confuses me...

Until the APR consumer, including an instance of httpd, actually calls
setlocale(), everything should be behaving as expected.  If your in-process
code under httpd calls setlocale() to customize its behavior based on the
HTTP consumer's locale, that is when things may go badly under the hood in
both httpd and in APR.

But yes, I flagged this to the security team almost immediately and then
had to research what could introduce such a vulnerability of accepting
unexpected input and treating it as valid ASCII.  I was less concerned with
treating valid ASCII as opaque text which would be rejected.

Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

Posted by William A Rowe Jr <wr...@rowe-clan.net>.
On Nov 25, 2015 12:00, "Mikhail T." <mi...@aldan.algebra.com> wrote:
>
> On 25.11.2015 12:42, William A Rowe Jr wrote:
>>
>> If the script switches setlocale to turkish, for example, our
forced-lowercase content-type conversion
>> will cause "IMAGE/GIF" to become "ımage/gıf", clearly not what the specs
intended.
>
> I'm sorry, could you elaborate on this? Would not strtolower(3) convert
"IMAGE/GIF" to "image/gif" in all locales -- including "C"? At least, in
all single-byte charsets -- such as the Turkish ISO 8859-9? Yes, the
function will act differently on the strings containing octets above 127,
but those would occur neither in content-types nor in header-names...

Two variables, LC_CTYPE and LC_COLLATE control this text processing
behavior.  The above is the correct lower case transliteration for
Turkish.  In German, the upper case correspondence of sharp-S ß is 'SS',
but multi-char translation is not provided by the simple tolower/toupper
functions.

Consider this is a function of language, and not of 'charset' per-say.  The
same charset behaves differently based on the locale's language.

>> Adding unambiguous token handling functions would be good for the few
case-insensitive string comparison, string folding, and search functions.
It allows the spec-consumer to trust their string processing.
>
> Up until now, I thought, the thread was about coming up with a short-cut
-- an optimization for processing tokens, like request-headers, which are
known to be in US-ASCII anyway and where using locale-aware functions is
simply wasteful -- but not incorrect.

Partially so, that was the motivation behind the proposal.  Apparently OS/X
in particular has a slow implementation of strcasecmp even running under
the Posix locale.

> You seem to imply, the locale-aware functions might be doing the wrong
thing some times -- and this confuses me...

Until the APR consumer, including an instance of httpd, actually calls
setlocale(), everything should be behaving as expected.  If your in-process
code under httpd calls setlocale() to customize its behavior based on the
HTTP consumer's locale, that is when things may go badly under the hood in
both httpd and in APR.

But yes, I flagged this to the security team almost immediately and then
had to research what could introduce such a vulnerability of accepting
unexpected input and treating it as valid ASCII.  I was less concerned with
treating valid ASCII as opaque text which would be rejected.

Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

Posted by "Mikhail T." <mi...@aldan.algebra.com>.
On 25.11.2015 12:42, William A Rowe Jr wrote:
> If the script switches setlocale to turkish, for example, our
> forced-lowercase content-type conversion 
> will cause "IMAGE/GIF" to become "ımage/gıf", clearly not what the
> specs intended.
I'm sorry, could you elaborate on this? Would not strtolower(3) convert
"IMAGE/GIF" to "image/gif" in /all/ locales -- including "C"? At least,
in all single-byte charsets -- such as the Turkish ISO 8859-9
<https://en.wikipedia.org/wiki/ISO/IEC_8859-9>? Yes, the function will
act differently on the strings containing octets above 127, but those
would occur neither in content-types nor in header-names...
> Adding unambiguous token handling functions would be good for the few
> case-insensitive string comparison, string folding, and search
> functions.  It allows the spec-consumer to trust their string processing.
Up until now, I thought, the thread was about coming up with a short-cut
-- an optimization for processing tokens, like request-headers, which
are known to be in US-ASCII anyway and where using locale-aware
functions is simply wasteful -- but not incorrect.

You seem to imply, the locale-aware functions might be doing the wrong
thing some times -- and this confuses me...

Yours,

    -mi


Fwd: apr_token_* conclusions (was: Better casecmpstr[n]?)

Posted by William A Rowe Jr <wr...@rowe-clan.net>.
tl;dr - feel free to skip "httpd" specifics to get to the meat of the APR
discussion items.


---------- Forwarded message ----------
From: William A Rowe Jr <wr...@rowe-clan.net>
Date: Wed, Nov 25, 2015 at 11:42 AM
Subject: apr_token_* conclusions (was: Better casecmpstr[n]?)
To: httpd <de...@httpd.apache.org>


On Wed, Nov 25, 2015 at 10:17 AM, Jim Jagielski <ji...@jagunet.com> wrote:

> What is the current status? Is this on hold?
>

It is looking for a good name.  I'm happy with apr_token_strcasecmp
to best indicate its use-case and provenance.  Does that work for
everyone?

It is looking for clearer docs.  Spent 20 hours just reviewing locale in C
(partly to phrase this discussion more accurately). About to pen those
based on better definitions of terms.

So here are my conclusions as they apply to apr, and to httpd after
all of this locale review...

Background
-----------------

The C spec defines the locale as "C" (e.g. "POSIX") at startup.  Until
somewhere in the code the locale of the current thread is switched with
a call to setlocale(LC_ALL, "") or similar, the application remains in this
non-deterministic (Anglicized) state.  The empty string causes LANG
and the LC_* variables to be evaluated.  Most modern *nix utilities do
this right off the bat in their source.  The compiler should *not* do so
without instruction, so while gcc does the right thing in this respect,
other compilers may not be behaving appropriately.

You might react "how do we handle UTF-8 [or other code page] then?"
The answer is that in the "C" locale, high-bit characters (in ASCII, and
even unusual EBCDIC codes) are effectively opaque.  They *can* be
UTF-8, they might belong to an SBCS (single-byte charset), and they
might be entirely meaningless.  They won't be case folded (but will
keep their unique identity).  Some consumer may recognize them,
others will not.  The C lib functions will treat each octet of an MBCS
as a distinct character, which is a reason that our old-school autoindex
(pre-fancy tables) misaligns the columns when the filename contains
any multibyte characters.  We had simply byte-counted chars.

If our code splits these multibyte sequences, things can go wrong
somewhere down the way.  In particular, treating these multibyte
sequences as distinct characters is fine in UTF-8 where every part
of the multibyte sequence is high-bit-set (and therefore opaque),
but is not fine in ISO2022-JP, where low-bit-set characters may
change their meaning (there are bugs to the effect that we search
a string for pathname separator characters, and these can occur
in multibyte sequences which are valid file names, and not path
separator characters).  There isn't much we can do about this.

When using httpd on a *nix filesystem containing UTF-8 chars, we
accept UTF-8 filenames both in client provided fields and within the
httpd.conf configuration.  On a filesystem containing SBCS filenames,
we similarly accept these without translation.  It's up to the admin
to decide their schema, without a third party module, the UTF-8
name isn't recognized in an SBCS directory, and visa-versa.

On Windows, all system resources, including filenames, are stored
in Unicode by the OS.  Within APR, we simply treat all system
resource strings as UTF-8, and therefore the conf file in httpd needs
to spell out the UTF-8 name of the resource. All that said, other than
these resource names, even on Windows most string processing
follows the same opaque logic as *nix.

The Mac OS is somewhat similar to Windows, in that all of the
resource names are actually UTF-8 encoded, AFAICT. But unlike
windows, these opaque strings "just work", we don't have to do any
Unicode transliteration to get there.

Observations
-------------------

Nowhere in apr or httpd do *we* call setlocale() to change things.  So
the current use of tolower(), toupper(), strcasecmp() etc should not be
subject to dangerous transliterations by *our* doing.

There is nothing that suggests that the APR consumer *cannot*
call setlocale()!  So we need to revisit APR 2.0 and ensure that
our functions perform as-expected even when operating under
some different locale.  This suggests that apr_token_str[n]casecmp,
apr_token_tolower|toupper, and much of the rest of the code that
exists only to evaluate ASCII alpha characters is *not* paying any
attention to the currently defined locale.

In httpd, where things go sideways is if someone is calling setlocale(),
for example in an in-process PHP, Perl or Lua script, because this
changes the core operation of httpd.  If the script switches setlocale
to turkish, for example, our forced-lowercase content-type conversion
will cause "IMAGE/GIF" to become "ımage/gıf", clearly not what the
specs intended.

I imagine that this has rarely shown up because the few scripts that
might toggle this were correctly written to toggle back the previous
locale upon completion.  But if they are running inline and have
the locale toggled during the filter handoff, the filters themselves
are likely facing some unexpected behavior.

APR conclusions
-------------------------

Adding unambiguous token handling functions would be good for
the few case-insensitive string comparison, string folding, and
search functions.  It allows the spec-consumer to trust their string
processing.  I'm going to suggest apr_token_* as the API prefix.
The API will preserve "POSIX behavior" as defined by A-Z <> a-z
character equivalence and make no compensation for locales
or for MBCS strings.

httpd conclusions
-------------------------

There are so many edge cases that we simply need to preserve
the code as-is in httpd 2.4 and warn off anyone toggling the locale
that httpd is operating under.  Making the 'offer' to run under many
non-POSIX locales is opening up a can of security vulnerabilities.
It would be irresponsible to half-fix this.

We should perform a thorough review of httpd 2.x trunk so as to
make a statement upon release that the code has been adapted
to correctly operate under most non-POSIX locales.  Warn off
users from third party modules that have not yet made a similar
statement.  Kill all the redundant httpd 2.x == APR functions that
do not belong in trunk (but may have been appropriate in 2.earlier
while waiting for APR to be released).

To the extent that implementations have very poor implementations
of strcmp(), this isn't justification to overload httpd with more band-aids.
Fix the underlying implementation.  So I'm -0.5 on backporting this
change into httpd 2.4 until we see a comprehensive justification, or
until we comprehensively have fixed trunk and are prepared to make
the same "setlocale()-safe" assertion about 2.4.future.

Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

Posted by Jim Jagielski <ji...@jaguNET.com>.
ascii? ascii? ascii?????

:-)

> On Nov 25, 2015, at 4:52 PM, Christophe JAILLET <ch...@wanadoo.fr> wrote:
> 
> Hi,
> 
> just in case off, gnome as a set of function g_ascii_...
> (see https://developer.gnome.org/glib/2.28/glib-String-Utility-Functions.html#g-ascii-strcasecmp)
> 
>> 
>> I'm also waiting for feedback about the naming convention, I'd like to get
>> this into APR yesterday and start building on it, but it's hard to name our
>> generic-posix tolower/toupper until we agree on the naming scheme :)
>> 
>> 
> 


Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

Posted by William A Rowe Jr <wr...@rowe-clan.net>.
On Wed, Nov 25, 2015 at 3:55 PM, William A Rowe Jr <wr...@rowe-clan.net>
wrote:

> On Wed, Nov 25, 2015 at 3:52 PM, Christophe JAILLET <
> christophe.jaillet@wanadoo.fr> wrote:
>
>> Hi,
>>
>> just in case off, gnome as a set of function g_ascii_...
>> (see
>> https://developer.gnome.org/glib/2.28/glib-String-Utility-Functions.html#g-ascii-strcasecmp
>> )
>
>
> Interesting, does anyone know offhand whether these perform the expected
> or the stated behavior under EBCDIC environments?
>
>

Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

Posted by William A Rowe Jr <wr...@rowe-clan.net>.
On Thu, Jan 21, 2016 at 4:18 PM, William A Rowe Jr <wr...@rowe-clan.net>
wrote:

>
> Based on apr's short-name preference, I had yet to redecorate
> these functions as apr_cstr_* functions, but that I will get to
> tomorrow.  If you see something that doesn't fall into the normal
> string / general purpose criteria, feel free to holler before the first
> commit...
>

And yes, I know it doesn't conform yet to our doxygen, one of the
reasons it was delayed until I could finish cleaning that up.  The
entire exercise is an svn cp of the sources from subversion so that
we preserve revision history into another home within apr.

Re: apr_token_* conclusions

Posted by Stefan Sperling <st...@apache.org>.
On Wed, Jan 27, 2016 at 10:40:06PM -0600, William A Rowe Jr wrote:
> If you are new to the conversation, include/apr_cstr.h has absorbed much of
> the efforts of svn_cstring_* API's into apr_cstr_* functions.

I'm very happy to see our strtol()-wrappers in APR. These wrap the POSIX
functions with strict error checking. I hope this will encourage APR
consumers to routinely check for errors while parsing numbers rather than
trusting input. We did this for SVN and it caught a range of issues from
simple user input problems to detection of integer overflows caused by
repository on-disk corruption.

Note that we do have a special strtol() implementation for performance
critical paths in the repository filesystem code:
^/subversion/trunk/subversion/libsvn_fs_fs/id.c:locale_independent_strtol()

Some parts of the filesystem still use svn_cstring_strtoi64() instead
because they're either not performance critical or require specific
range checks.

Re: apr_token_* conclusions

Posted by Stefan Sperling <st...@apache.org>.
On Wed, Jan 27, 2016 at 10:40:06PM -0600, William A Rowe Jr wrote:
> If you are new to the conversation, include/apr_cstr.h has absorbed much of
> the efforts of svn_cstring_* API's into apr_cstr_* functions.

I'm very happy to see our strtol()-wrappers in APR. These wrap the POSIX
functions with strict error checking. I hope this will encourage APR
consumers to routinely check for errors while parsing numbers rather than
trusting input. We did this for SVN and it caught a range of issues from
simple user input problems to detection of integer overflows caused by
repository on-disk corruption.

Note that we do have a special strtol() implementation for performance
critical paths in the repository filesystem code:
^/subversion/trunk/subversion/libsvn_fs_fs/id.c:locale_independent_strtol()

Some parts of the filesystem still use svn_cstring_strtoi64() instead
because they're either not performance critical or require specific
range checks.

Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

Posted by Jim Jagielski <ji...@jaguNET.com>.
I'm assuming that the 'new in 1.6' refers to APR 1.6...
In which case, I'm not sure what the Warning for apr_cstr_strtoui64()
refers to, version-wise.

> On Jan 26, 2016, at 3:58 PM, William A Rowe Jr <wr...@rowe-clan.net> wrote:
> 
> Sorry, meant to attach something legible...
> Apache Portable Runtime
> 	• Main Page
> 	• Related Pages
> 	• Modules
> 	• Namespaces
> 	• Data Structures
> 	• Files
> 
> Functions
> C (POSIX locale) string functions
> String routines
> Functions
> apr_array_header_t * 	apr_cstr_split (const char *input, const char *sep_chars, int chop_whitespace, apr_pool_t *pool)
>  
> void 	apr_cstr_split_append (apr_array_header_t *array, const char *input, const char *sep_chars, int chop_whitespace, apr_pool_t *pool)
>  
> int 	apr_cstr_match_glob_list (const char *str, const apr_array_header_t *list)
>  
> int 	apr_cstr_match_list (const char *str, const apr_array_header_t *list)
>  
> char * 	apr_cstr_tokenize (const char *sep, char **str)
>  
> int 	apr_cstr_count_newlines (const char *msg)
>  
> char * 	apr_cstr_join (const apr_array_header_t *strings, const char *separator, apr_pool_t *pool)
>  
> int 	apr_cstr_casecmp (const char *str1, const char *str2)
>  
> apr_status_t 	apr_cstr_strtoi64 (apr_int64_t *n, const char *str, apr_int64_t minval, apr_int64_t maxval, int base)
>  
> apr_status_t 	apr_cstr_atoi64 (apr_int64_t *n, const char *str)
>  
> apr_status_t 	apr_cstr_atoi (int *n, const char *str)
>  
> apr_status_t 	apr_cstr_strtoui64 (apr_uint64_t *n, const char *str, apr_uint64_t minval, apr_uint64_t maxval, int base)
>  
> apr_status_t 	apr_cstr_atoui64 (apr_uint64_t *n, const char *str)
>  
> apr_status_t 	apr_cstr_atoui (unsigned int *n, const char *str)
>  
> const char * 	apr_cstr_skip_prefix (const char *str, const char *prefix)
>  
> Detailed Description
> 
> The apr_cstr_* functions provide traditional C char * string text handling, and notabilty they treat all text in the C (a.k.a. POSIX) locale using the minimal POSIX character set, represented in either ASCII or a corresponding EBCDIC subset.
> 
> Character values outside of that set are treated as opaque bytes, and all multi-byte character sequences are handled as individual distinct octets.
> 
> Multi-byte characters sequences whose octets fall in the ASCII range cause unexpected results, such as in the ISO-2022-JP code page where ASCII octets occur within both shift-state and multibyte sequences.
> 
> In the case of the UTF-8 encoding, all multibyte characters all fall outside of the C/POSIX range of characters, so these functions are generally safe to use on UTF-8 strings. The programmer must be aware that each octet may not represent a distinct printable character in such encodings.
> 
> The standard C99/POSIX string functions, rather than apr_cstr, should be used in all cases where the current locale and encoding of the text is significant.
> 
> Function Documentation
> 
> apr_status_t apr_cstr_atoi	(	int * 	n,
> const char * 	str 
> )		
> Parse the C string str into a 32 bit number, and return it in *n. Assume that the number is represented in base 10. Raise an error if conversion fails (e.g. due to overflow).
> 
> The behaviour otherwise is as described for apr_cstr_strtoi64().
> 
> Since
> New in 1.6.
> apr_status_t apr_cstr_atoi64	(	apr_int64_t * 	n,
> const char * 	str 
> )		
> Parse the C string str into a 64 bit number, and return it in *n. Assume that the number is represented in base 10. Raise an error if conversion fails (e.g. due to overflow).
> 
> The behaviour otherwise is as described for apr_cstr_strtoi64().
> 
> Since
> New in 1.6.
> apr_status_t apr_cstr_atoui	(	unsigned int * 	n,
> const char * 	str 
> )		
> Parse the C string str into an unsigned 32 bit number, and return it in *n. Assume that the number is represented in base 10. Raise an error if conversion fails (e.g. due to overflow).
> 
> The behaviour otherwise is as described for apr_cstr_strtoui64(), including the upper limit of APR_INT64_MAX.
> 
> Since
> New in 1.6.
> apr_status_t apr_cstr_atoui64	(	apr_uint64_t * 	n,
> const char * 	str 
> )		
> Parse the C string str into an unsigned 64 bit number, and return it in *n. Assume that the number is represented in base 10. Raise an error if conversion fails (e.g. due to overflow).
> 
> The behaviour otherwise is as described for apr_cstr_strtoui64(), including the upper limit of APR_INT64_MAX.
> 
> Since
> New in 1.6.
> int apr_cstr_casecmp	(	const char * 	str1,
> const char * 	str2 
> )		
> Compare two strings atr1 and atr2, treating case-equivalent unaccented Latin (ASCII subset) letters as equal.
> 
> Returns in integer greater than, equal to, or less than 0, according to whether str1 is considered greater than, equal to, or less thanstr2.
> 
> Since
> New in 1.6.
> int apr_cstr_count_newlines	(	const char * 	msg	)	
> Return the number of line breaks in msg, allowing any kind of newline termination (CR, LF, CRLF, or LFCR), even inconsistent.
> 
> Since
> New in 1.6.
> char* apr_cstr_join	(	const apr_array_header_t * 	strings,
> const char * 	separator,
> apr_pool_t * 	pool 
> )		
> Return a cstring which is the concatenation of strings (an array of char *) each followed by separator (that is, separator will also end the resulting string). Allocate the result in pool. If strings is empty, then return the empty string.
> 
> Since
> New in 1.6.
> int apr_cstr_match_glob_list	(	const char * 	str,
> const apr_array_header_t * 	list 
> )		
> Return TRUE iff str matches any of the elements of list, a list of zero or more glob patterns.
> 
> int apr_cstr_match_list	(	const char * 	str,
> const apr_array_header_t * 	list 
> )		
> Return TRUE iff str exactly matches any of the elements of list.
> 
> Since
> new in 1.7
> const char* apr_cstr_skip_prefix	(	const char * 	str,
> const char * 	prefix 
> )		
> Skip the common prefix prefix from the C string str, and return a pointer to the next character after the prefix. Return NULL if str does not start with prefix.
> 
> Since
> New in 1.6.
> apr_array_header_t* apr_cstr_split	(	const char * 	input,
> const char * 	sep_chars,
> int 	chop_whitespace,
> apr_pool_t * 	pool 
> )		
> Divide input into substrings, interpreting any char from sep as a token separator.
> 
> Return an array of copies of those substrings (plain const char*), allocating both the array and the copies in pool.
> 
> None of the elements added to the array contain any of the characters in sep_chars, and none of the new elements are empty (thus, it is possible that the returned array will have length zero).
> 
> If chop_whitespace is TRUE, then remove leading and trailing whitespace from the returned strings.
> 
> void apr_cstr_split_append	(	apr_array_header_t * 	array,
> const char * 	input,
> const char * 	sep_chars,
> int 	chop_whitespace,
> apr_pool_t * 	pool 
> )		
> Like apr_cstr_split(), but append to existing array instead of creating a new one. Allocate the copied substrings in pool (i.e., caller decides whether or not to pass array->pool as pool).
> 
> apr_status_t apr_cstr_strtoi64	(	apr_int64_t * 	n,
> const char * 	str,
> apr_int64_t 	minval,
> apr_int64_t 	maxval,
> int 	base 
> )		
> Parse the C string str into a 64 bit number, and return it in *n. Assume that the number is represented in base base. Raise an error if conversion fails (e.g. due to overflow), or if the converted number is smaller than minval or larger than maxval.
> 
> Leading whitespace in str is skipped in a locale-dependent way. After that, the string may contain an optional '+' (positive, default) or '-' (negative) character, followed by an optional '0x' prefix if base is 0 or 16, followed by numeric digits appropriate for the base. If there are any more characters after the numeric digits, an error is returned.
> 
> If base is zero, then a leading '0x' or '0X' prefix means hexadecimal, else a leading '0' means octal (implemented, though not documented, in apr_strtoi64() in APR 0.9.0 through 1.5.0), else use base ten.
> 
> Since
> New in 1.6.
> apr_status_t apr_cstr_strtoui64	(	apr_uint64_t * 	n,
> const char * 	str,
> apr_uint64_t 	minval,
> apr_uint64_t 	maxval,
> int 	base 
> )		
> Parse the C string str into an unsigned 64 bit number, and return it in *n. Assume that the number is represented in base base. Raise an error if conversion fails (e.g. due to overflow), or if the converted number is smaller than minval or larger than maxval.
> 
> Leading whitespace in str is skipped in a locale-dependent way. After that, the string may contain an optional '+' (positive, default) or '-' (negative) character, followed by an optional '0x' prefix if base is 0 or 16, followed by numeric digits appropriate for the base. If there are any more characters after the numeric digits, an error is returned.
> 
> If base is zero, then a leading '0x' or '0X' prefix means hexadecimal, else a leading '0' means octal (implemented, though not documented, in apr_strtoi64() in APR 0.9.0 through 1.5.0), else use base ten.
> 
> Warning
> The implementation used since version 1.7 returns an error if the parsed number is greater than APR_INT64_MAX, even if it is not greater than maxval.
> Since
> New in 1.6.
> char* apr_cstr_tokenize	(	const char * 	sep,
> char ** 	str 
> )		
> Get the next token from *str interpreting any char from sep as a token separator. Separators at the beginning of str will be skipped. Returns a pointer to the beginning of the first token in *str or NULL if no token is left. Modifies str such that the next call will return the next token.
> 
> Note
> The content of *str may be modified by this function.
> Since
> New in 1.6.
> Generated by    1.8.10
> 
> On Tue, Jan 26, 2016 at 2:57 PM, William A Rowe Jr <wr...@rowe-clan.net> wrote:
> On Thu, Jan 21, 2016 at 4:18 PM, William A Rowe Jr <wr...@rowe-clan.net> wrote:
> This is as far as I got on my last iteration, electing what appear
> to be 'normal string' handling functions that are part of svn.
> 
> Based on apr's short-name preference, I had yet to redecorate
> these functions as apr_cstr_* functions, but that I will get to
> tomorrow.  If you see something that doesn't fall into the normal
> string / general purpose criteria, feel free to holler before the first
> commit...
> 
> This is what is going in shortly... we don't have an svn_boolean_t
> so those become int's, while svn_error_t * becomes an apr_status_t.
> 
> I'll proceed to commit this full set for scrutiny before digging through
> for the various overlapping functions within apr and even across httpd.
> 
> 
> 
> 


Re: apr_token_* conclusions

Posted by William A Rowe Jr <wr...@rowe-clan.net>.
On Wed, Jan 27, 2016 at 10:11 PM, Branko Čibej <br...@apache.org> wrote:

> >
> > Stating that equivalent-case are treated as equal states that the
> > code points "A"-"Z" are all treated as equal, and "a"-"z" are all
> > treated as equal (and "A" and "a" would be treated as unique
> > of one another) LOL
>
> I guess we're using different meanings of the term 'equivalence group.'
> Doesn't really matter as long as the behaviour is correct. Since we also
> spell 'behaviour' differently, it's not surprising that we're talking
> past each other. Not the same language, y'know; wrong locale. :)
>

No doubt, but I'm certain the Queen and I agree on Jim's and my
interpretation
as stated above. The statement was unambiguously about "case-equivalence"
and the letters A and a are in no way case-equivalent, in any
interpretation of
English nor in any particular locale. Also, I've always failed [American]
English
spelling (I'm equally bad at the Queen's English), as Mike was kind enough
to
point out and correct me earlier today :)

Thankfully there is no s/z ambiguity in the word 'interpretation', LOL!
But we do all seem to all be on the same page, irrespective of locale.

I really want to thank everyone at Subversion who put this code together,
it was very well thought-out.  I'd like to figure out how to restore the
pipeline
of svn committer contributions to the core APR libraries, as well as httpd
committer contributions, and all the other downstream consumer developer's
efforts.  We have always had a very relaxed committer admission policy,
with a very strict API policy, but it really is not hard to work with.

If you are new to the conversation, include/apr_cstr.h has absorbed much of
the efforts of svn_cstring_* API's into apr_cstr_* functions.  I'd like to
see more
of this and encourage you all to offer up enhancements. I imagine 1.6 is not
so far away, and with luck and enough contributions, we offer up a 1.7,
even
a 1.8 before we launch the unified apr+apr-util 2.0 sometime this year.

Cheers,

Bill

Re: apr_token_* conclusions

Posted by William A Rowe Jr <wr...@rowe-clan.net>.
On Wed, Jan 27, 2016 at 10:11 PM, Branko Čibej <br...@apache.org> wrote:

> >
> > Stating that equivalent-case are treated as equal states that the
> > code points "A"-"Z" are all treated as equal, and "a"-"z" are all
> > treated as equal (and "A" and "a" would be treated as unique
> > of one another) LOL
>
> I guess we're using different meanings of the term 'equivalence group.'
> Doesn't really matter as long as the behaviour is correct. Since we also
> spell 'behaviour' differently, it's not surprising that we're talking
> past each other. Not the same language, y'know; wrong locale. :)
>

No doubt, but I'm certain the Queen and I agree on Jim's and my
interpretation
as stated above. The statement was unambiguously about "case-equivalence"
and the letters A and a are in no way case-equivalent, in any
interpretation of
English nor in any particular locale. Also, I've always failed [American]
English
spelling (I'm equally bad at the Queen's English), as Mike was kind enough
to
point out and correct me earlier today :)

Thankfully there is no s/z ambiguity in the word 'interpretation', LOL!
But we do all seem to all be on the same page, irrespective of locale.

I really want to thank everyone at Subversion who put this code together,
it was very well thought-out.  I'd like to figure out how to restore the
pipeline
of svn committer contributions to the core APR libraries, as well as httpd
committer contributions, and all the other downstream consumer developer's
efforts.  We have always had a very relaxed committer admission policy,
with a very strict API policy, but it really is not hard to work with.

If you are new to the conversation, include/apr_cstr.h has absorbed much of
the efforts of svn_cstring_* API's into apr_cstr_* functions.  I'd like to
see more
of this and encourage you all to offer up enhancements. I imagine 1.6 is not
so far away, and with luck and enough contributions, we offer up a 1.7,
even
a 1.8 before we launch the unified apr+apr-util 2.0 sometime this year.

Cheers,

Bill

Re: apr_token_* conclusions

Posted by Branko Čibej <br...@apache.org>.
On 27.01.2016 20:56, William A Rowe Jr wrote:
> On Wed, Jan 27, 2016 at 6:29 AM, Jim Jagielski <jim@jagunet.com
> <ma...@jagunet.com>> wrote:
>
>
>     > On Jan 27, 2016, at 4:44 AM, Branko Čibej <brane@apache.org
>     <ma...@apache.org>> wrote:
>     >
>     >
>     > Hmph, it's concise, not confusing. Subversion's APIs expect all
>     strings
>     > to be encoded in UTF-8, so the docstring can't just say
>     > "case-insensitive" because that would be extremely misleading in
>     that
>     > context.
>     >
>     > APR makes no promises about the encoding, but mentioning that these
>     > functions are designed to work with the ASCII subset (or EBCDIC
>     > equivalent of same) would be quite important, I think?
>
>     I have no idea how encoding matters at all to the meaning
>     of case sensitivity... unless, somehow, 'A' and 'a' are
>     encoded to the exact same value.
>
>     In pretty much every description of string and character
>     comparison functions I've ever encountered, the terms "case
>     sensitive", "case insensitive" or "ignoring case" have all
>     been used to describe whether or not the function considers
>     the case of the character when doing the comparison. I've
>     never seen one use the phrase 'case-equivalent' which implies
>     the exact opposite of what it actually does.
>
>
> I committed a fix I like but am still open to edits.
>
> Stating that equivalent-case are treated as equal states that the
> code points "A"-"Z" are all treated as equal, and "a"-"z" are all
> treated as equal (and "A" and "a" would be treated as unique
> of one another) LOL

I guess we're using different meanings of the term 'equivalence group.'
Doesn't really matter as long as the behaviour is correct. Since we also
spell 'behaviour' differently, it's not surprising that we're talking
past each other. Not the same language, y'know; wrong locale. :)

-- Brane

Re: apr_token_* conclusions

Posted by Branko Čibej <br...@apache.org>.
On 27.01.2016 13:29, Jim Jagielski wrote:
>> On Jan 27, 2016, at 4:44 AM, Branko Čibej <br...@apache.org> wrote:
>>
>>
>> Hmph, it's concise, not confusing. Subversion's APIs expect all strings
>> to be encoded in UTF-8, so the docstring can't just say
>> "case-insensitive" because that would be extremely misleading in that
>> context.
>>
>> APR makes no promises about the encoding, but mentioning that these
>> functions are designed to work with the ASCII subset (or EBCDIC
>> equivalent of same) would be quite important, I think?
> I have no idea how encoding matters at all to the meaning
> of case sensitivity... unless, somehow, 'A' and 'a' are
> encoded to the exact same value.

The important part is the bit about "unaccented Latin letters". Without
that clarification, "case-insensitive" in UTF-8 implies that the byte
sequence "\xC7\xBC" compares equal to "\xC7\xBD" (i.e., 'Ǽ' == 'ǽ'),
which is clearly not how that function works; and I'm ignoring fun
issues with Unicode normalization forms.

So yes, encoding does matter.

-- Brane


Re: apr_token_* conclusions

Posted by William A Rowe Jr <wr...@rowe-clan.net>.
On Wed, Jan 27, 2016 at 6:29 AM, Jim Jagielski <ji...@jagunet.com> wrote:

>
> > On Jan 27, 2016, at 4:44 AM, Branko Čibej <br...@apache.org> wrote:
> >
> >
> > Hmph, it's concise, not confusing. Subversion's APIs expect all strings
> > to be encoded in UTF-8, so the docstring can't just say
> > "case-insensitive" because that would be extremely misleading in that
> > context.
> >
> > APR makes no promises about the encoding, but mentioning that these
> > functions are designed to work with the ASCII subset (or EBCDIC
> > equivalent of same) would be quite important, I think?
>
> I have no idea how encoding matters at all to the meaning
> of case sensitivity... unless, somehow, 'A' and 'a' are
> encoded to the exact same value.
>
> In pretty much every description of string and character
> comparison functions I've ever encountered, the terms "case
> sensitive", "case insensitive" or "ignoring case" have all
> been used to describe whether or not the function considers
> the case of the character when doing the comparison. I've
> never seen one use the phrase 'case-equivalent' which implies
> the exact opposite of what it actually does.


I committed a fix I like but am still open to edits.

Stating that equivalent-case are treated as equal states that the
code points "A"-"Z" are all treated as equal, and "a"-"z" are all
treated as equal (and "A" and "a" would be treated as unique
of one another) LOL

Re: apr_token_* conclusions

Posted by Jim Jagielski <ji...@jaguNET.com>.
> On Jan 27, 2016, at 4:44 AM, Branko Čibej <br...@apache.org> wrote:
> 
> 
> Hmph, it's concise, not confusing. Subversion's APIs expect all strings
> to be encoded in UTF-8, so the docstring can't just say
> "case-insensitive" because that would be extremely misleading in that
> context.
> 
> APR makes no promises about the encoding, but mentioning that these
> functions are designed to work with the ASCII subset (or EBCDIC
> equivalent of same) would be quite important, I think?

I have no idea how encoding matters at all to the meaning
of case sensitivity... unless, somehow, 'A' and 'a' are
encoded to the exact same value.

In pretty much every description of string and character
comparison functions I've ever encountered, the terms "case
sensitive", "case insensitive" or "ignoring case" have all
been used to describe whether or not the function considers
the case of the character when doing the comparison. I've
never seen one use the phrase 'case-equivalent' which implies
the exact opposite of what it actually does.

Re: apr_token_* conclusions

Posted by Branko Čibej <br...@apache.org>.
On 27.01.2016 01:17, William A Rowe Jr wrote:
> On Tue, Jan 26, 2016 at 5:16 PM, Jim Jagielski <jim@jagunet.com
> <ma...@jagunet.com>> wrote:
>
>
>     > On Jan 26, 2016, at 4:39 PM, William A Rowe Jr
>     <wrowe@rowe-clan.net <ma...@rowe-clan.net>> wrote:
>     >
>     > On Tue, Jan 26, 2016 at 3:12 PM, Jim Jagielski <jim@jagunet.com
>     <ma...@jagunet.com>> wrote:
>     > I'm assuming that the 'new in 1.6' refers to APR 1.6...
>     > In which case, I'm not sure what the Warning for
>     apr_cstr_strtoui64()
>     > refers to, version-wise.
>     >
>     > Good catch, trashing the version reference, but keeping the caution.
>     >
>     >
>     > On Tue, Jan 26, 2016 at 3:15 PM, Jim Jagielski <jim@jagunet.com
>     <ma...@jagunet.com>> wrote:
>     > Also, I see apr_cstr_casecmp() but not a case insensitive
>     version... ??
>     >
>     > casecmp means case-insensitive (c.f. strcasecmp).  There is no case-
>     > sensitive match, at least not yet.  Consider strcmp always just
>     works except
>     > in a string containing a NULL-octet multibyte continuation
>     characters, and
>     > we wouldn't speak any such beast in a C/POSIX locale in the
>     first place :)
>
>     The description sez:
>
>        "Compare two strings atr1 and atr2, treating case-equivalent
>     unaccented Latin (ASCII subset) letters as equal."
>
>     which implies, at least to me, case sensitive. I don't read "case
>     equivalent" as case insensitive... equivalence implies "the same"
>     to me.
>
>
> Agreed it is very confusing - I had wordsmithed it a bit, but please
> feel free 
> to further correct now that it lives on apr trunk (2.0).  In fact,
> anyone feel free
> to dive in...

Hmph, it's concise, not confusing. Subversion's APIs expect all strings
to be encoded in UTF-8, so the docstring can't just say
"case-insensitive" because that would be extremely misleading in that
context.

APR makes no promises about the encoding, but mentioning that these
functions are designed to work with the ASCII subset (or EBCDIC
equivalent of same) would be quite important, I think?

-- Brane


Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

Posted by Yann Ylavic <yl...@gmail.com>.
On Wed, Jan 27, 2016 at 1:17 AM, William A Rowe Jr <wr...@rowe-clan.net> wrote:
>
> Yann, I didn't pick up our research yet to tweak the SVN implementation,
> since we never really tested that.  The signed->unsigned transition is a
> noop,
> so the only question is the fastest way to structure the loop.  Hopefully
> the
> compiler can do a good job for all architectures.  In the svn
> implementation,
> this was through a single character casecmp function/macro that handled
> signed chars, and expanded to an int (which "in theory" is an optimal array
> index, our mileage may vary).
>
> Let's compare performance of this implementation and commit the best
> enhancement.

I'll do some tests with this new implementation.
IIRC, svn_cstring_casecmp() (using svn_ctype_casecmp()) was included
in my latest testing.
It did not perform as well as the current ap_casecmpstr(), but I need
to recheck with these two only now, with latests implementations.

Stay tuned,
Yann.

Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

Posted by William A Rowe Jr <wr...@rowe-clan.net>.
On Tue, Jan 26, 2016 at 5:16 PM, Jim Jagielski <ji...@jagunet.com> wrote:

>
> > On Jan 26, 2016, at 4:39 PM, William A Rowe Jr <wr...@rowe-clan.net>
> wrote:
> >
> > On Tue, Jan 26, 2016 at 3:12 PM, Jim Jagielski <ji...@jagunet.com> wrote:
> > I'm assuming that the 'new in 1.6' refers to APR 1.6...
> > In which case, I'm not sure what the Warning for apr_cstr_strtoui64()
> > refers to, version-wise.
> >
> > Good catch, trashing the version reference, but keeping the caution.
> >
> >
> > On Tue, Jan 26, 2016 at 3:15 PM, Jim Jagielski <ji...@jagunet.com> wrote:
> > Also, I see apr_cstr_casecmp() but not a case insensitive version... ??
> >
> > casecmp means case-insensitive (c.f. strcasecmp).  There is no case-
> > sensitive match, at least not yet.  Consider strcmp always just works
> except
> > in a string containing a NULL-octet multibyte continuation characters,
> and
> > we wouldn't speak any such beast in a C/POSIX locale in the first place
> :)
>
> The description sez:
>
>    "Compare two strings atr1 and atr2, treating case-equivalent unaccented
> Latin (ASCII subset) letters as equal."
>
> which implies, at least to me, case sensitive. I don't read "case
> equivalent" as case insensitive... equivalence implies "the same"
> to me.
>

Agreed it is very confusing - I had wordsmithed it a bit, but please feel
free
to further correct now that it lives on apr trunk (2.0).  In fact, anyone
feel free
to dive in...

Yann, I didn't pick up our research yet to tweak the SVN implementation,
since we never really tested that.  The signed->unsigned transition is a
noop,
so the only question is the fastest way to structure the loop.  Hopefully
the
compiler can do a good job for all architectures.  In the svn
implementation,
this was through a single character casecmp function/macro that handled
signed chars, and expanded to an int (which "in theory" is an optimal array
index, our mileage may vary).

Let's compare performance of this implementation and commit the best
enhancement.  Finally have a VC14 environment handy to compare and
contrast the alternatives performance on Windows as well.

Bill

Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

Posted by Jim Jagielski <ji...@jaguNET.com>.
Also, I see apr_cstr_casecmp() but not a case insensitive version... ??

> On Jan 26, 2016, at 3:58 PM, William A Rowe Jr <wr...@rowe-clan.net> wrote:
> 
> Sorry, meant to attach something legible...
> Apache Portable Runtime
> 	• Main Page
> 	• Related Pages
> 	• Modules
> 	• Namespaces
> 	• Data Structures
> 	• Files
> 
> Functions
> C (POSIX locale) string functions
> String routines
> Functions
> apr_array_header_t * 	apr_cstr_split (const char *input, const char *sep_chars, int chop_whitespace, apr_pool_t *pool)
>  
> void 	apr_cstr_split_append (apr_array_header_t *array, const char *input, const char *sep_chars, int chop_whitespace, apr_pool_t *pool)
>  
> int 	apr_cstr_match_glob_list (const char *str, const apr_array_header_t *list)
>  
> int 	apr_cstr_match_list (const char *str, const apr_array_header_t *list)
>  
> char * 	apr_cstr_tokenize (const char *sep, char **str)
>  
> int 	apr_cstr_count_newlines (const char *msg)
>  
> char * 	apr_cstr_join (const apr_array_header_t *strings, const char *separator, apr_pool_t *pool)
>  
> int 	apr_cstr_casecmp (const char *str1, const char *str2)
>  
> apr_status_t 	apr_cstr_strtoi64 (apr_int64_t *n, const char *str, apr_int64_t minval, apr_int64_t maxval, int base)
>  
> apr_status_t 	apr_cstr_atoi64 (apr_int64_t *n, const char *str)
>  
> apr_status_t 	apr_cstr_atoi (int *n, const char *str)
>  
> apr_status_t 	apr_cstr_strtoui64 (apr_uint64_t *n, const char *str, apr_uint64_t minval, apr_uint64_t maxval, int base)
>  
> apr_status_t 	apr_cstr_atoui64 (apr_uint64_t *n, const char *str)
>  
> apr_status_t 	apr_cstr_atoui (unsigned int *n, const char *str)
>  
> const char * 	apr_cstr_skip_prefix (const char *str, const char *prefix)
>  
> Detailed Description
> 
> The apr_cstr_* functions provide traditional C char * string text handling, and notabilty they treat all text in the C (a.k.a. POSIX) locale using the minimal POSIX character set, represented in either ASCII or a corresponding EBCDIC subset.
> 
> Character values outside of that set are treated as opaque bytes, and all multi-byte character sequences are handled as individual distinct octets.
> 
> Multi-byte characters sequences whose octets fall in the ASCII range cause unexpected results, such as in the ISO-2022-JP code page where ASCII octets occur within both shift-state and multibyte sequences.
> 
> In the case of the UTF-8 encoding, all multibyte characters all fall outside of the C/POSIX range of characters, so these functions are generally safe to use on UTF-8 strings. The programmer must be aware that each octet may not represent a distinct printable character in such encodings.
> 
> The standard C99/POSIX string functions, rather than apr_cstr, should be used in all cases where the current locale and encoding of the text is significant.
> 
> Function Documentation
> 
> apr_status_t apr_cstr_atoi	(	int * 	n,
> const char * 	str 
> )		
> Parse the C string str into a 32 bit number, and return it in *n. Assume that the number is represented in base 10. Raise an error if conversion fails (e.g. due to overflow).
> 
> The behaviour otherwise is as described for apr_cstr_strtoi64().
> 
> Since
> New in 1.6.
> apr_status_t apr_cstr_atoi64	(	apr_int64_t * 	n,
> const char * 	str 
> )		
> Parse the C string str into a 64 bit number, and return it in *n. Assume that the number is represented in base 10. Raise an error if conversion fails (e.g. due to overflow).
> 
> The behaviour otherwise is as described for apr_cstr_strtoi64().
> 
> Since
> New in 1.6.
> apr_status_t apr_cstr_atoui	(	unsigned int * 	n,
> const char * 	str 
> )		
> Parse the C string str into an unsigned 32 bit number, and return it in *n. Assume that the number is represented in base 10. Raise an error if conversion fails (e.g. due to overflow).
> 
> The behaviour otherwise is as described for apr_cstr_strtoui64(), including the upper limit of APR_INT64_MAX.
> 
> Since
> New in 1.6.
> apr_status_t apr_cstr_atoui64	(	apr_uint64_t * 	n,
> const char * 	str 
> )		
> Parse the C string str into an unsigned 64 bit number, and return it in *n. Assume that the number is represented in base 10. Raise an error if conversion fails (e.g. due to overflow).
> 
> The behaviour otherwise is as described for apr_cstr_strtoui64(), including the upper limit of APR_INT64_MAX.
> 
> Since
> New in 1.6.
> int apr_cstr_casecmp	(	const char * 	str1,
> const char * 	str2 
> )		
> Compare two strings atr1 and atr2, treating case-equivalent unaccented Latin (ASCII subset) letters as equal.
> 
> Returns in integer greater than, equal to, or less than 0, according to whether str1 is considered greater than, equal to, or less thanstr2.
> 
> Since
> New in 1.6.
> int apr_cstr_count_newlines	(	const char * 	msg	)	
> Return the number of line breaks in msg, allowing any kind of newline termination (CR, LF, CRLF, or LFCR), even inconsistent.
> 
> Since
> New in 1.6.
> char* apr_cstr_join	(	const apr_array_header_t * 	strings,
> const char * 	separator,
> apr_pool_t * 	pool 
> )		
> Return a cstring which is the concatenation of strings (an array of char *) each followed by separator (that is, separator will also end the resulting string). Allocate the result in pool. If strings is empty, then return the empty string.
> 
> Since
> New in 1.6.
> int apr_cstr_match_glob_list	(	const char * 	str,
> const apr_array_header_t * 	list 
> )		
> Return TRUE iff str matches any of the elements of list, a list of zero or more glob patterns.
> 
> int apr_cstr_match_list	(	const char * 	str,
> const apr_array_header_t * 	list 
> )		
> Return TRUE iff str exactly matches any of the elements of list.
> 
> Since
> new in 1.7
> const char* apr_cstr_skip_prefix	(	const char * 	str,
> const char * 	prefix 
> )		
> Skip the common prefix prefix from the C string str, and return a pointer to the next character after the prefix. Return NULL if str does not start with prefix.
> 
> Since
> New in 1.6.
> apr_array_header_t* apr_cstr_split	(	const char * 	input,
> const char * 	sep_chars,
> int 	chop_whitespace,
> apr_pool_t * 	pool 
> )		
> Divide input into substrings, interpreting any char from sep as a token separator.
> 
> Return an array of copies of those substrings (plain const char*), allocating both the array and the copies in pool.
> 
> None of the elements added to the array contain any of the characters in sep_chars, and none of the new elements are empty (thus, it is possible that the returned array will have length zero).
> 
> If chop_whitespace is TRUE, then remove leading and trailing whitespace from the returned strings.
> 
> void apr_cstr_split_append	(	apr_array_header_t * 	array,
> const char * 	input,
> const char * 	sep_chars,
> int 	chop_whitespace,
> apr_pool_t * 	pool 
> )		
> Like apr_cstr_split(), but append to existing array instead of creating a new one. Allocate the copied substrings in pool (i.e., caller decides whether or not to pass array->pool as pool).
> 
> apr_status_t apr_cstr_strtoi64	(	apr_int64_t * 	n,
> const char * 	str,
> apr_int64_t 	minval,
> apr_int64_t 	maxval,
> int 	base 
> )		
> Parse the C string str into a 64 bit number, and return it in *n. Assume that the number is represented in base base. Raise an error if conversion fails (e.g. due to overflow), or if the converted number is smaller than minval or larger than maxval.
> 
> Leading whitespace in str is skipped in a locale-dependent way. After that, the string may contain an optional '+' (positive, default) or '-' (negative) character, followed by an optional '0x' prefix if base is 0 or 16, followed by numeric digits appropriate for the base. If there are any more characters after the numeric digits, an error is returned.
> 
> If base is zero, then a leading '0x' or '0X' prefix means hexadecimal, else a leading '0' means octal (implemented, though not documented, in apr_strtoi64() in APR 0.9.0 through 1.5.0), else use base ten.
> 
> Since
> New in 1.6.
> apr_status_t apr_cstr_strtoui64	(	apr_uint64_t * 	n,
> const char * 	str,
> apr_uint64_t 	minval,
> apr_uint64_t 	maxval,
> int 	base 
> )		
> Parse the C string str into an unsigned 64 bit number, and return it in *n. Assume that the number is represented in base base. Raise an error if conversion fails (e.g. due to overflow), or if the converted number is smaller than minval or larger than maxval.
> 
> Leading whitespace in str is skipped in a locale-dependent way. After that, the string may contain an optional '+' (positive, default) or '-' (negative) character, followed by an optional '0x' prefix if base is 0 or 16, followed by numeric digits appropriate for the base. If there are any more characters after the numeric digits, an error is returned.
> 
> If base is zero, then a leading '0x' or '0X' prefix means hexadecimal, else a leading '0' means octal (implemented, though not documented, in apr_strtoi64() in APR 0.9.0 through 1.5.0), else use base ten.
> 
> Warning
> The implementation used since version 1.7 returns an error if the parsed number is greater than APR_INT64_MAX, even if it is not greater than maxval.
> Since
> New in 1.6.
> char* apr_cstr_tokenize	(	const char * 	sep,
> char ** 	str 
> )		
> Get the next token from *str interpreting any char from sep as a token separator. Separators at the beginning of str will be skipped. Returns a pointer to the beginning of the first token in *str or NULL if no token is left. Modifies str such that the next call will return the next token.
> 
> Note
> The content of *str may be modified by this function.
> Since
> New in 1.6.
> Generated by    1.8.10
> 
> On Tue, Jan 26, 2016 at 2:57 PM, William A Rowe Jr <wr...@rowe-clan.net> wrote:
> On Thu, Jan 21, 2016 at 4:18 PM, William A Rowe Jr <wr...@rowe-clan.net> wrote:
> This is as far as I got on my last iteration, electing what appear
> to be 'normal string' handling functions that are part of svn.
> 
> Based on apr's short-name preference, I had yet to redecorate
> these functions as apr_cstr_* functions, but that I will get to
> tomorrow.  If you see something that doesn't fall into the normal
> string / general purpose criteria, feel free to holler before the first
> commit...
> 
> This is what is going in shortly... we don't have an svn_boolean_t
> so those become int's, while svn_error_t * becomes an apr_status_t.
> 
> I'll proceed to commit this full set for scrutiny before digging through
> for the various overlapping functions within apr and even across httpd.
> 
> 
> 
> 


Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

Posted by William A Rowe Jr <wr...@rowe-clan.net>.
Sorry, meant to attach something legible...
Apache Portable Runtime

   - Main Page <index.html>
   - Related Pages <pages.html>
   - Modules <modules.html>
   - Namespaces <namespaces.html>
   - Data Structures <annotated.html>
   - Files <files.html>
   -

Functions <#func-members>
C (POSIX locale) string functions
String routines <group__apr__strings.html>
Functions
apr_array_header_t <structapr__array__header__t.html> *  apr_cstr_split
<group__apr__cstr.html#ga50b5f39a52d18f3211440602642b5702> (const char
*input, const char *sep_chars, int chop_whitespace, apr_pool_t
<group__apr__pools.html#gaf137f28edcf9a086cd6bc36c20d7cdfb> *pool)

void  apr_cstr_split_append
<group__apr__cstr.html#ga380343bfadb7332ece4c042701fa065a> (
apr_array_header_t <structapr__array__header__t.html> *array, const char
*input, const char *sep_chars, int chop_whitespace, apr_pool_t
<group__apr__pools.html#gaf137f28edcf9a086cd6bc36c20d7cdfb> *pool)

int  apr_cstr_match_glob_list
<group__apr__cstr.html#ga59052f471fdbe1456fe283f665395be2> (const char
*str, const apr_array_header_t <structapr__array__header__t.html> *list)

int  apr_cstr_match_list
<group__apr__cstr.html#ga5884f17dd18202b5dd4388c6664e794b> (const char
*str, const apr_array_header_t <structapr__array__header__t.html> *list)

char *  apr_cstr_tokenize
<group__apr__cstr.html#ga43296ec627eefd3d55de5cfb13cc8935> (const char
*sep, char **str)

int  apr_cstr_count_newlines
<group__apr__cstr.html#ga6d776750e0b201588e1c24501cd65f71> (const char *msg)

char *  apr_cstr_join
<group__apr__cstr.html#ga876441d066aacb75b7320b80ca4d4d29> (const
apr_array_header_t <structapr__array__header__t.html> *strings, const char
*separator, apr_pool_t
<group__apr__pools.html#gaf137f28edcf9a086cd6bc36c20d7cdfb> *pool)

int  apr_cstr_casecmp
<group__apr__cstr.html#ga07584e519301a67ff066c30be2785e44> (const char
*str1, const char *str2)

apr_status_t <group__apr__errno.html#gaf76ee4543247e9fb3f3546203e590a6c>
apr_cstr_strtoi64
<group__apr__cstr.html#ga701ca1217727de647676d41809b5ef90>
(apr_int64_t
*n, const char *str, apr_int64_t minval, apr_int64_t maxval, int base)

apr_status_t <group__apr__errno.html#gaf76ee4543247e9fb3f3546203e590a6c>
apr_cstr_atoi64
<group__apr__cstr.html#gac22106194f17480bb9d18be39cfed1dd>
(apr_int64_t
*n, const char *str)

apr_status_t <group__apr__errno.html#gaf76ee4543247e9fb3f3546203e590a6c>
apr_cstr_atoi <group__apr__cstr.html#gad363278febcb4e43ff0affe51a5c8c68> (int
*n, const char *str)

apr_status_t <group__apr__errno.html#gaf76ee4543247e9fb3f3546203e590a6c>
apr_cstr_strtoui64
<group__apr__cstr.html#ga6398304e57dfb85546234ba9b466ec76> (apr_uint64_t
*n, const char *str, apr_uint64_t minval, apr_uint64_t maxval, int base)

apr_status_t <group__apr__errno.html#gaf76ee4543247e9fb3f3546203e590a6c>
apr_cstr_atoui64
<group__apr__cstr.html#ga0439b516bc8fd258a9b2c7271fe8347d>
(apr_uint64_t
*n, const char *str)

apr_status_t <group__apr__errno.html#gaf76ee4543247e9fb3f3546203e590a6c>
apr_cstr_atoui <group__apr__cstr.html#ga4c90f8722d90f833af3e003b9420090d>
(unsigned
int *n, const char *str)

const char *  apr_cstr_skip_prefix
<group__apr__cstr.html#ga04e457538985940b16f8759076ca3068> (const char
*str, const char *prefix)
 Detailed Description

The apr_cstr_* functions provide traditional C char * string text handling,
and notabilty they treat all text in the C (a.k.a. POSIX) locale using the
minimal POSIX character set, represented in either ASCII or a corresponding
EBCDIC subset.

Character values outside of that set are treated as opaque bytes, and all
multi-byte character sequences are handled as individual distinct octets.

Multi-byte characters sequences whose octets fall in the ASCII range cause
unexpected results, such as in the ISO-2022-JP code page where ASCII octets
occur within both shift-state and multibyte sequences.

In the case of the UTF-8 encoding, all multibyte characters all fall
outside of the C/POSIX range of characters, so these functions are
generally safe to use on UTF-8 strings. The programmer must be aware that
each octet may not represent a distinct printable character in such
encodings.

The standard C99/POSIX string functions, rather than apr_cstr, should be
used in all cases where the current locale and encoding of the text is
significant.
Function Documentation
apr_status_t <group__apr__errno.html#gaf76ee4543247e9fb3f3546203e590a6c>
 apr_cstr_atoi ( int *  n,
const char *  str
)

Parse the C string *str* into a 32 bit number, and return it in **n*.
Assume that the number is represented in base 10. Raise an error if
conversion fails (e.g. due to overflow).

The behaviour otherwise is as described for apr_cstr_strtoi64()
<group__apr__cstr.html#ga701ca1217727de647676d41809b5ef90>.
SinceNew in 1.6.
apr_status_t <group__apr__errno.html#gaf76ee4543247e9fb3f3546203e590a6c>
 apr_cstr_atoi64 ( apr_int64_t *  n,
const char *  str
)

Parse the C string *str* into a 64 bit number, and return it in **n*.
Assume that the number is represented in base 10. Raise an error if
conversion fails (e.g. due to overflow).

The behaviour otherwise is as described for apr_cstr_strtoi64()
<group__apr__cstr.html#ga701ca1217727de647676d41809b5ef90>.
SinceNew in 1.6.
apr_status_t <group__apr__errno.html#gaf76ee4543247e9fb3f3546203e590a6c>
 apr_cstr_atoui ( unsigned int *  n,
const char *  str
)

Parse the C string *str* into an unsigned 32 bit number, and return it in
**n*. Assume that the number is represented in base 10. Raise an error if
conversion fails (e.g. due to overflow).

The behaviour otherwise is as described for apr_cstr_strtoui64()
<group__apr__cstr.html#ga6398304e57dfb85546234ba9b466ec76>, including the
upper limit of APR_INT64_MAX.
SinceNew in 1.6.
apr_status_t <group__apr__errno.html#gaf76ee4543247e9fb3f3546203e590a6c>
 apr_cstr_atoui64 ( apr_uint64_t *  n,
const char *  str
)

Parse the C string *str* into an unsigned 64 bit number, and return it in
**n*. Assume that the number is represented in base 10. Raise an error if
conversion fails (e.g. due to overflow).

The behaviour otherwise is as described for apr_cstr_strtoui64()
<group__apr__cstr.html#ga6398304e57dfb85546234ba9b466ec76>, including the
upper limit of APR_INT64_MAX.
SinceNew in 1.6.
int apr_cstr_casecmp ( const char *  str1,
const char *  str2
)

Compare two strings *atr1* and *atr2*, treating case-equivalent unaccented
Latin (ASCII subset) letters as equal.

Returns in integer greater than, equal to, or less than 0, according to
whether *str1* is considered greater than, equal to, or less than*str2*.
SinceNew in 1.6.
int apr_cstr_count_newlines ( const char *  msg )

Return the number of line breaks in *msg*, allowing any kind of newline
termination (CR, LF, CRLF, or LFCR), even inconsistent.
SinceNew in 1.6.
char* apr_cstr_join ( const apr_array_header_t
<structapr__array__header__t.html> *  strings,
const char *  separator,
apr_pool_t <group__apr__pools.html#gaf137f28edcf9a086cd6bc36c20d7cdfb> *
pool
)

Return a cstring which is the concatenation of *strings* (an array of char
*) each followed by *separator* (that is, *separator* will also end the
resulting string). Allocate the result in *pool*. If *strings* is empty,
then return the empty string.
SinceNew in 1.6.
int apr_cstr_match_glob_list ( const char *  str,
const apr_array_header_t <structapr__array__header__t.html> *  list
)

Return TRUE iff *str* matches any of the elements of *list*, a list of zero
or more glob patterns.
int apr_cstr_match_list ( const char *  str,
const apr_array_header_t <structapr__array__header__t.html> *  list
)

Return TRUE iff *str* exactly matches any of the elements of *list*.
Sincenew in 1.7
const char* apr_cstr_skip_prefix ( const char *  str,
const char *  prefix
)

Skip the common prefix *prefix* from the C string *str*, and return a
pointer to the next character after the prefix. Return NULL if *str* does
not start with *prefix*.
SinceNew in 1.6.
apr_array_header_t <structapr__array__header__t.html>* apr_cstr_split ( const
char *  input,
const char *  sep_chars,
int  chop_whitespace,
apr_pool_t <group__apr__pools.html#gaf137f28edcf9a086cd6bc36c20d7cdfb> *
pool
)

Divide *input* into substrings, interpreting any char from *sep* as a token
separator.

Return an array of copies of those substrings (plain const char*),
allocating both the array and the copies in *pool*.

None of the elements added to the array contain any of the characters in
*sep_chars*, and none of the new elements are empty (thus, it is possible
that the returned array will have length zero).

If *chop_whitespace* is TRUE, then remove leading and trailing whitespace
from the returned strings.
void apr_cstr_split_append ( apr_array_header_t
<structapr__array__header__t.html> *  array,
const char *  input,
const char *  sep_chars,
int  chop_whitespace,
apr_pool_t <group__apr__pools.html#gaf137f28edcf9a086cd6bc36c20d7cdfb> *
pool
)

Like apr_cstr_split()
<group__apr__cstr.html#ga50b5f39a52d18f3211440602642b5702>, but append to
existing *array* instead of creating a new one. Allocate the copied
substrings in *pool* (i.e., caller decides whether or not to pass
*array->pool* as *pool*).
apr_status_t <group__apr__errno.html#gaf76ee4543247e9fb3f3546203e590a6c>
 apr_cstr_strtoi64 ( apr_int64_t *  n,
const char *  str,
apr_int64_t  minval,
apr_int64_t  maxval,
int  base
)

Parse the C string *str* into a 64 bit number, and return it in **n*.
Assume that the number is represented in base *base*. Raise an error if
conversion fails (e.g. due to overflow), or if the converted number is
smaller than *minval* or larger than *maxval*.

Leading whitespace in *str* is skipped in a locale-dependent way. After
that, the string may contain an optional '+' (positive, default) or '-'
(negative) character, followed by an optional '0x' prefix if *base* is 0 or
16, followed by numeric digits appropriate for the base. If there are any
more characters after the numeric digits, an error is returned.

If *base* is zero, then a leading '0x' or '0X' prefix means hexadecimal,
else a leading '0' means octal (implemented, though not documented, in
apr_strtoi64() <group__apr__strings.html#ga1da34829609e8976f498b235afd6cbe4> in
APR 0.9.0 through 1.5.0), else use base ten.
SinceNew in 1.6.
apr_status_t <group__apr__errno.html#gaf76ee4543247e9fb3f3546203e590a6c>
 apr_cstr_strtoui64 ( apr_uint64_t *  n,
const char *  str,
apr_uint64_t  minval,
apr_uint64_t  maxval,
int  base
)

Parse the C string *str* into an unsigned 64 bit number, and return it in
**n*. Assume that the number is represented in base *base*. Raise an error
if conversion fails (e.g. due to overflow), or if the converted number is
smaller than *minval* or larger than *maxval*.

Leading whitespace in *str* is skipped in a locale-dependent way. After
that, the string may contain an optional '+' (positive, default) or '-'
(negative) character, followed by an optional '0x' prefix if *base* is 0 or
16, followed by numeric digits appropriate for the base. If there are any
more characters after the numeric digits, an error is returned.

If *base* is zero, then a leading '0x' or '0X' prefix means hexadecimal,
else a leading '0' means octal (implemented, though not documented, in
apr_strtoi64() <group__apr__strings.html#ga1da34829609e8976f498b235afd6cbe4> in
APR 0.9.0 through 1.5.0), else use base ten.
WarningThe implementation used since version 1.7 returns an error if the
parsed number is greater than APR_INT64_MAX, even if it is not greater than
*maxval*.SinceNew in 1.6.
char* apr_cstr_tokenize ( const char *  sep,
char **  str
)

Get the next token from **str* interpreting any char from *sep* as a token
separator. Separators at the beginning of *str* will be skipped. Returns a
pointer to the beginning of the first token in **str* or NULL if no token
is left. Modifies *str* such that the next call will return the next token.
NoteThe content of **str* may be modified by this function.SinceNew in 1.6.
------------------------------
Generated by   [image: doxygen]  <http://www.doxygen.org/index.html>1.8.10

On Tue, Jan 26, 2016 at 2:57 PM, William A Rowe Jr <wr...@rowe-clan.net>
wrote:

> On Thu, Jan 21, 2016 at 4:18 PM, William A Rowe Jr <wr...@rowe-clan.net>
> wrote:
>
>> This is as far as I got on my last iteration, electing what appear
>> to be 'normal string' handling functions that are part of svn.
>>
>> Based on apr's short-name preference, I had yet to redecorate
>> these functions as apr_cstr_* functions, but that I will get to
>> tomorrow.  If you see something that doesn't fall into the normal
>> string / general purpose criteria, feel free to holler before the first
>> commit...
>>
>
> This is what is going in shortly... we don't have an svn_boolean_t
> so those become int's, while svn_error_t * becomes an apr_status_t.
>
> I'll proceed to commit this full set for scrutiny before digging through
> for the various overlapping functions within apr and even across httpd.
>
>
>
>

Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

Posted by William A Rowe Jr <wr...@rowe-clan.net>.
On Thu, Jan 21, 2016 at 4:18 PM, William A Rowe Jr <wr...@rowe-clan.net>
wrote:

> This is as far as I got on my last iteration, electing what appear
> to be 'normal string' handling functions that are part of svn.
>
> Based on apr's short-name preference, I had yet to redecorate
> these functions as apr_cstr_* functions, but that I will get to
> tomorrow.  If you see something that doesn't fall into the normal
> string / general purpose criteria, feel free to holler before the first
> commit...
>

This is what is going in shortly... we don't have an svn_boolean_t
so those become int's, while svn_error_t * becomes an apr_status_t.

I'll proceed to commit this full set for scrutiny before digging through
for the various overlapping functions within apr and even across httpd.

Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

Posted by William A Rowe Jr <wr...@rowe-clan.net>.
This is as far as I got on my last iteration, electing what appear
to be 'normal string' handling functions that are part of svn.

Based on apr's short-name preference, I had yet to redecorate
these functions as apr_cstr_* functions, but that I will get to
tomorrow.  If you see something that doesn't fall into the normal
string / general purpose criteria, feel free to holler before the first
commit...

Bill

On Thu, Jan 21, 2016 at 4:07 PM, William A Rowe Jr <wr...@rowe-clan.net>
wrote:

> No time to respond until pressing $dayjob stuff is finished this evening,
> but I have the entire day tomorrow to devote to bringing the proposed
> change to trunk/ and proposing for backport to branches/1.6/
>
> On Thu, Jan 21, 2016 at 10:47 AM, Jim Jagielski <ji...@jagunet.com> wrote:
>
>> Any updates on this??
>>
>
>

Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

Posted by William A Rowe Jr <wr...@rowe-clan.net>.
No time to respond until pressing $dayjob stuff is finished this evening,
but I have the entire day tomorrow to devote to bringing the proposed
change to trunk/ and proposing for backport to branches/1.6/

On Thu, Jan 21, 2016 at 10:47 AM, Jim Jagielski <ji...@jagunet.com> wrote:

> Any updates on this??
>

Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

Posted by Jim Jagielski <ji...@jaguNET.com>.
Any updates on this??

Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

Posted by William A Rowe Jr <wr...@rowe-clan.net>.
>
> On Nov 26, 2015 08:39, "William A Rowe Jr" <wr...@rowe-clan.net> wrote:
>
>> Sounds right... Actually a fusion between svn_cstring_* and several
>> existing ap_ and apr_ functions would be useful.
>>
>> SVN folk, any objection to APR appropriating these API's?  20/20
>> hindsight, is apr_cstring_ or shorter apr_cstr_ the way to go here?  You
>> all had to use the thing so I trust your preferences.  Either expresses
>> locale C in my mind, so they work for me.
>> On Nov 26, 2015 07:38, "Jim Jagielski" <ji...@jagunet.com> wrote:
>>
>>> Yeah, SVN's 'svn_cstring_casecmp' and how it's used is
>>> pretty much inline with my thoughts on how httpd would
>>> use ours...
>>>
>>> > On Nov 25, 2015, at 5:10 PM, Bert Huijben <be...@qqmail.nl> wrote:
>>> >
>>> > We have a set of similar comparison functions in Subversion. I’m
>>> pretty sure we already had these in the time we still had ebcdic support on
>>> trunk.
>>> > (We removed that support years ago, but the code should still live on
>>> a branch)
>>>
>>
Given our long preference for short-ish names (apr_sockaddr, apr_finfo_
etc...),
the fact that apr_cstr_ can be read as 'C String' functions (which could
reflect
the programming language and also its C/Posix locale), I have time this week
to add the svn work and start tracking down candidates/equivilance between
that apr_cstr_ API and httpd's functions/use case.  Yann replied that
apr_cstr_
is acceptable to him, didn't hear any other feedback except Greg's
clarification
of what svn_cstring_ means.

If there are any last minute reservations about that naming convention,
please
holler.

Re: apr_token_* conclusions

Posted by William A Rowe Jr <wr...@rowe-clan.net>.
The only question in my mind, after thinking about this all day, is how do
we (plural) de-escalate this immature behaviour between senior ASF
members?  If there was a time to fall on your own katana James, that most
recent post was it.

Let's cut the s*&t and just code some cool stuff?  If you are on that page,
quit apologizing.  If you are out to score some cool public posts, it might
be time to hang up that hat.  And it isn't specific to Jim, anyone who
complains that APR moves too slow wasn't hanging around here 12 yrs ago,
the moment it is time for a release, there will be the momentum for release.

Grow up everyone, this isn't httpd.  Let's code.
On Nov 30, 2015 06:15, "Jim Jagielski" <ji...@jagunet.com> wrote:

>
> > On Nov 27, 2015, at 2:15 PM, Branko Čibej <br...@apache.org> wrote:
> >
> > On 27.11.2015 15:59, Jim Jagielski wrote:
> >>> On Nov 26, 2015, at 8:49 PM, Branko Čibej <br...@apache.org> wrote:
> >>>
> >>> In any case — I don't think anyone over at dev@s.a.o would object to
> APR
> >>> including those functions. We actually have a number of other, heh,
> >>> improvements on APR that we could "donate"; we just never really got
> >>> around to producing the necessary patches.
> >> Yeah, svn is in the same situation as httpd. There are
> >> some functions would "ideally" would exist in APR,
> >> but APR doesn't move "fast enough" to allow that to
> >> happen, so both projects start collecting APR-like
> >> kruft after awhile...
> >>
> >> It certainly would be nice if there was someway to address
> >> that...
> >
> > Uh, what I wrote is in no way intended to be a criticism of APR. Maybe
> > if people who think APR isn't moving fast enough spent their time
> > writing code here instead of writing mails about it, this "problem"
> > would just vanish. At least, that's my understanding of how open source
> > is supposed to work -- right, Jim? ;)
>
> Gosh! You are right! Gee whiz, I haven't had any substantial code added
> to APR since 1.5.1. Thanks for reminding me! I really feel completely
> and utterly unworthy to comment on or criticize APR in any meaningful
> way and I offer my heartfelt apologies to everyone on this thread
> for wasting their time on a thread which started off as a suggestion
> for a new function to be added to httpd but, I noted:
>
>     I propose a ap_strncasecmp/ap_strcasecmp which we should use.
>     Ideally, it would be in apr but no need to wait for that
>     to happen :)
>
> which, at least how I read it, implies code to be added to APR
> in the ideal case. But that is besides the point, as Brane so
> correctly says! Instead of writing emails about code to be
> added, we should instead be writing the code itself, which,
> of course, will be accepted in as-is with no discussion whatsoever,
> since, heck, that's kind of what's going on here, but as Brane
> reminds us all, such a thing is a problem that will magically
> disappear the more we write code!
>
> I propose that no message be allowed on the dev@apr list unless
> a 15-line patch or code contribution is attached. This will
> solve the nasty problem! In fact, maybe we should just shut down
> dev@apr since it encourages such unconstructive behavior as "writing
> emails" when we instead should be head's down cranking out code
> that may, or may not (usually not) be added and used before the
> end of the decade.

Re: apr_token_* conclusions

Posted by Jim Jagielski <ji...@jaguNET.com>.
> On Nov 27, 2015, at 2:15 PM, Branko Čibej <br...@apache.org> wrote:
> 
> On 27.11.2015 15:59, Jim Jagielski wrote:
>>> On Nov 26, 2015, at 8:49 PM, Branko Čibej <br...@apache.org> wrote:
>>> 
>>> In any case — I don't think anyone over at dev@s.a.o would object to APR
>>> including those functions. We actually have a number of other, heh,
>>> improvements on APR that we could "donate"; we just never really got
>>> around to producing the necessary patches.
>> Yeah, svn is in the same situation as httpd. There are
>> some functions would "ideally" would exist in APR,
>> but APR doesn't move "fast enough" to allow that to
>> happen, so both projects start collecting APR-like
>> kruft after awhile...
>> 
>> It certainly would be nice if there was someway to address
>> that...
> 
> Uh, what I wrote is in no way intended to be a criticism of APR. Maybe
> if people who think APR isn't moving fast enough spent their time
> writing code here instead of writing mails about it, this "problem"
> would just vanish. At least, that's my understanding of how open source
> is supposed to work -- right, Jim? ;)

Gosh! You are right! Gee whiz, I haven't had any substantial code added
to APR since 1.5.1. Thanks for reminding me! I really feel completely
and utterly unworthy to comment on or criticize APR in any meaningful
way and I offer my heartfelt apologies to everyone on this thread
for wasting their time on a thread which started off as a suggestion
for a new function to be added to httpd but, I noted:

    I propose a ap_strncasecmp/ap_strcasecmp which we should use.
    Ideally, it would be in apr but no need to wait for that
    to happen :)

which, at least how I read it, implies code to be added to APR
in the ideal case. But that is besides the point, as Brane so
correctly says! Instead of writing emails about code to be
added, we should instead be writing the code itself, which,
of course, will be accepted in as-is with no discussion whatsoever,
since, heck, that's kind of what's going on here, but as Brane
reminds us all, such a thing is a problem that will magically
disappear the more we write code!

I propose that no message be allowed on the dev@apr list unless
a 15-line patch or code contribution is attached. This will
solve the nasty problem! In fact, maybe we should just shut down
dev@apr since it encourages such unconstructive behavior as "writing
emails" when we instead should be head's down cranking out code
that may, or may not (usually not) be added and used before the
end of the decade.

Re: apr_token_* conclusions

Posted by Branko Čibej <br...@apache.org>.
On 27.11.2015 15:59, Jim Jagielski wrote:
>> On Nov 26, 2015, at 8:49 PM, Branko Čibej <br...@apache.org> wrote:
>>
>> In any case — I don't think anyone over at dev@s.a.o would object to APR
>> including those functions. We actually have a number of other, heh,
>> improvements on APR that we could "donate"; we just never really got
>> around to producing the necessary patches.
> Yeah, svn is in the same situation as httpd. There are
> some functions would "ideally" would exist in APR,
> but APR doesn't move "fast enough" to allow that to
> happen, so both projects start collecting APR-like
> kruft after awhile...
>
> It certainly would be nice if there was someway to address
> that...

Uh, what I wrote is in no way intended to be a criticism of APR. Maybe
if people who think APR isn't moving fast enough spent their time
writing code here instead of writing mails about it, this "problem"
would just vanish. At least, that's my understanding of how open source
is supposed to work -- right, Jim? ;)

-- Brane

Re: apr_token_* conclusions

Posted by Jim Jagielski <ji...@jaguNET.com>.
> On Nov 26, 2015, at 8:49 PM, Branko Čibej <br...@apache.org> wrote:
> 
> In any case — I don't think anyone over at dev@s.a.o would object to APR
> including those functions. We actually have a number of other, heh,
> improvements on APR that we could "donate"; we just never really got
> around to producing the necessary patches.

Yeah, svn is in the same situation as httpd. There are
some functions would "ideally" would exist in APR,
but APR doesn't move "fast enough" to allow that to
happen, so both projects start collecting APR-like
kruft after awhile...

It certainly would be nice if there was someway to address
that...

Re: apr_token_* conclusions

Posted by William A Rowe Jr <wr...@rowe-clan.net>.
On Thu, Nov 26, 2015 at 7:49 PM, Branko Čibej <br...@apache.org> wrote:

> On 26.11.2015 22:55, William A Rowe Jr wrote:
> > On Nov 26, 2015 11:03 AM, "Branko Čibej" <br...@apache.org> wrote:
> >> On 26.11.2015 15:44, William A Rowe Jr wrote:
> >>> Better if I address this Q to svn folks at the APR project :)
> >>> On Nov 26, 2015 08:39, "William A Rowe Jr" <wr...@rowe-clan.net>
> wrote:
> >>>
> >>>> Sounds right... Actually a fusion between svn_cstring_* and several
> >>>> existing ap_ and apr_ functions would be useful.
> >>>>
> >>>> SVN folk, any objection to APR appropriating these API's?  20/20
> >>>> hindsight, is apr_cstring_ or shorter apr_cstr_ the way to go here?
> > You
> >>>> all had to use the thing so I trust your preferences.  Either
> expresses
> >>>> locale C in my mind, so they work for me.
> >> Note that the svn_cstring* functions have *nothing* whatsoever to do
> >> with the "C" locale; they manipulate nul-terminated "C" strings, that's
> > all.
> >> svn_cstring_casecmp depends on svn_ctype_casecmp; the svn_ctype
> >> functions are expected to only work on the ASCII subset.
> >>
> >> -- Brane
> > Understood.
> >
> > Unlike svn we still support EBCDIC and so the use of the phrase 'ASCII'
> is
> > unnecessary confusing.
> >
> > The aliases C and POSIX both refer to the locale you describe.  Only
> ASCII
> > digits are recognised, only ASCII punctuation is honored, only ASCII
> alpha
> > are case-folded.
> >
> > Or the associated characters in the EBCDIC set.  All other byte values
> are
> > opaque.
> >
> > GCC deemed this important enough to add the g_ascii_str* gcc specific
> > extension functions.
> >
> > We are saying the same thing and reading, just using different semantics
> to
> > describe cstring.
>
> Well, not exactly; the svn_cstring_casecmp is the only function in that
> group that works as if it were always in the "C" locale. The others are
> are simply a convenience for managing variable-length nul-terminated
> strings. In Subversion, for example, their contents are usually encoded
> in UTF-8.
>

To clarify, in the "C" locale, utf-8 is just fine.  Do the other functions
treat
the opaque utf-8 high-bit-set characters specifically, or do they simply
treat them as individual distinct bytes?

If it is the later, that conforms to the C/POSIX locale, but if they are
actually
handled explicitly as utf-8 sequences, that gets a little more tricky.


> ASCII vs. EBCDIC (or any other single-byte encoding) is really only a
> matter of using different case folding and codepoint attribute tables
> (or equivalents; there's no reason the implementations have to be
> table-driven). More complex encodings are pretty much out of scope, IMO.
>

And we agree from an httpd perspective.  The issue is that we must handle
all ASCII (RFC-defined) sequences specifically and have no side-effects
that we weren't expecting, from a hardening perspective.

In any case — I don't think anyone over at dev@s.a.o would object to APR
> including those functions. We actually have a number of other, heh,
> improvements on APR that we could "donate"; we just never really got
> around to producing the necessary patches.
>

I hope as we start discussing 2.0 in more detail, that some of these come
through.  But I'm inclined not to wait and to begin forking this specific
API
as something that httpd 2.next needs, and some future version of 2.4.x may
decide it must adopt.

Thanks for the insights,

Bill

Re: apr_token_* conclusions

Posted by Branko Čibej <br...@apache.org>.
On 26.11.2015 22:55, William A Rowe Jr wrote:
> On Nov 26, 2015 11:03 AM, "Branko Čibej" <br...@apache.org> wrote:
>> On 26.11.2015 15:44, William A Rowe Jr wrote:
>>> Better if I address this Q to svn folks at the APR project :)
>>> On Nov 26, 2015 08:39, "William A Rowe Jr" <wr...@rowe-clan.net> wrote:
>>>
>>>> Sounds right... Actually a fusion between svn_cstring_* and several
>>>> existing ap_ and apr_ functions would be useful.
>>>>
>>>> SVN folk, any objection to APR appropriating these API's?  20/20
>>>> hindsight, is apr_cstring_ or shorter apr_cstr_ the way to go here?
> You
>>>> all had to use the thing so I trust your preferences.  Either expresses
>>>> locale C in my mind, so they work for me.
>> Note that the svn_cstring* functions have *nothing* whatsoever to do
>> with the "C" locale; they manipulate nul-terminated "C" strings, that's
> all.
>> svn_cstring_casecmp depends on svn_ctype_casecmp; the svn_ctype
>> functions are expected to only work on the ASCII subset.
>>
>> -- Brane
> Understood.
>
> Unlike svn we still support EBCDIC and so the use of the phrase 'ASCII' is
> unnecessary confusing.
>
> The aliases C and POSIX both refer to the locale you describe.  Only ASCII
> digits are recognised, only ASCII punctuation is honored, only ASCII alpha
> are case-folded.
>
> Or the associated characters in the EBCDIC set.  All other byte values are
> opaque.
>
> GCC deemed this important enough to add the g_ascii_str* gcc specific
> extension functions.
>
> We are saying the same thing and reading, just using different semantics to
> describe cstring.

Well, not exactly; the svn_cstring_casecmp is the only function in that
group that works as if it were always in the "C" locale. The others are
are simply a convenience for managing variable-length nul-terminated
strings. In Subversion, for example, their contents are usually encoded
in UTF-8.

ASCII vs. EBCDIC (or any other single-byte encoding) is really only a
matter of using different case folding and codepoint attribute tables
(or equivalents; there's no reason the implementations have to be
table-driven). More complex encodings are pretty much out of scope, IMO.


In any case — I don't think anyone over at dev@s.a.o would object to APR
including those functions. We actually have a number of other, heh,
improvements on APR that we could "donate"; we just never really got
around to producing the necessary patches.


-- Brane

Re: apr_token_* conclusions

Posted by William A Rowe Jr <wr...@rowe-clan.net>.
On Nov 26, 2015 11:03 AM, "Branko Čibej" <br...@apache.org> wrote:
>
> On 26.11.2015 15:44, William A Rowe Jr wrote:
> > Better if I address this Q to svn folks at the APR project :)
> > On Nov 26, 2015 08:39, "William A Rowe Jr" <wr...@rowe-clan.net> wrote:
> >
> >> Sounds right... Actually a fusion between svn_cstring_* and several
> >> existing ap_ and apr_ functions would be useful.
> >>
> >> SVN folk, any objection to APR appropriating these API's?  20/20
> >> hindsight, is apr_cstring_ or shorter apr_cstr_ the way to go here?
You
> >> all had to use the thing so I trust your preferences.  Either expresses
> >> locale C in my mind, so they work for me.
>
> Note that the svn_cstring* functions have *nothing* whatsoever to do
> with the "C" locale; they manipulate nul-terminated "C" strings, that's
all.
>
> svn_cstring_casecmp depends on svn_ctype_casecmp; the svn_ctype
> functions are expected to only work on the ASCII subset.
>
> -- Brane

Understood.

Unlike svn we still support EBCDIC and so the use of the phrase 'ASCII' is
unnecessary confusing.

The aliases C and POSIX both refer to the locale you describe.  Only ASCII
digits are recognised, only ASCII punctuation is honored, only ASCII alpha
are case-folded.

Or the associated characters in the EBCDIC set.  All other byte values are
opaque.

GCC deemed this important enough to add the g_ascii_str* gcc specific
extension functions.

We are saying the same thing and reading, just using different semantics to
describe cstring.

Re: apr_token_* conclusions

Posted by Branko Čibej <br...@apache.org>.
On 26.11.2015 15:44, William A Rowe Jr wrote:
> Better if I address this Q to svn folks at the APR project :)
> On Nov 26, 2015 08:39, "William A Rowe Jr" <wr...@rowe-clan.net> wrote:
>
>> Sounds right... Actually a fusion between svn_cstring_* and several
>> existing ap_ and apr_ functions would be useful.
>>
>> SVN folk, any objection to APR appropriating these API's?  20/20
>> hindsight, is apr_cstring_ or shorter apr_cstr_ the way to go here?  You
>> all had to use the thing so I trust your preferences.  Either expresses
>> locale C in my mind, so they work for me.

Note that the svn_cstring* functions have *nothing* whatsoever to do
with the "C" locale; they manipulate nul-terminated "C" strings, that's all.

svn_cstring_casecmp depends on svn_ctype_casecmp; the svn_ctype
functions are expected to only work on the ASCII subset.

-- Brane


>> On Nov 26, 2015 07:38, "Jim Jagielski" <ji...@jagunet.com> wrote:
>>
>>> Yeah, SVN's 'svn_cstring_casecmp' and how it's used is
>>> pretty much inline with my thoughts on how httpd would
>>> use ours...
>>>
>>>> On Nov 25, 2015, at 5:10 PM, Bert Huijben <be...@qqmail.nl> wrote:
>>>>
>>>> We have a set of similar comparison functions in Subversion. I’m pretty
>>> sure we already had these in the time we still had ebcdic support on trunk.
>>>> (We removed that support years ago, but the code should still live on a
>>> branch)
>>>> Bert
>>>>
>>>> From: William A Rowe Jr [mailto:wrowe@rowe-clan.net]
>>>> Sent: woensdag 25 november 2015 22:55
>>>> To: httpd <de...@httpd.apache.org>
>>>> Subject: Re: apr_token_* conclusions (was: Better casecmpstr[n]?)
>>>>
>>>> On Wed, Nov 25, 2015 at 3:52 PM, Christophe JAILLET <
>>> christophe.jaillet@wanadoo.fr> wrote:
>>>>> Hi,
>>>>>
>>>>> just in case off, gnome as a set of function g_ascii_...
>>>>> (see
>>> https://developer.gnome.org/glib/2.28/glib-String-Utility-Functions.html#g-ascii-strcasecmp
>>> )
>>>> Interesting, does anyone know offhand whether these perform the expected
>>>> or the stated behavior under EBCDIC environments?
>>>


Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

Posted by William A Rowe Jr <wr...@rowe-clan.net>.
Better if I address this Q to svn folks at the APR project :)
On Nov 26, 2015 08:39, "William A Rowe Jr" <wr...@rowe-clan.net> wrote:

> Sounds right... Actually a fusion between svn_cstring_* and several
> existing ap_ and apr_ functions would be useful.
>
> SVN folk, any objection to APR appropriating these API's?  20/20
> hindsight, is apr_cstring_ or shorter apr_cstr_ the way to go here?  You
> all had to use the thing so I trust your preferences.  Either expresses
> locale C in my mind, so they work for me.
> On Nov 26, 2015 07:38, "Jim Jagielski" <ji...@jagunet.com> wrote:
>
>> Yeah, SVN's 'svn_cstring_casecmp' and how it's used is
>> pretty much inline with my thoughts on how httpd would
>> use ours...
>>
>> > On Nov 25, 2015, at 5:10 PM, Bert Huijben <be...@qqmail.nl> wrote:
>> >
>> > We have a set of similar comparison functions in Subversion. I’m pretty
>> sure we already had these in the time we still had ebcdic support on trunk.
>> > (We removed that support years ago, but the code should still live on a
>> branch)
>> >
>> > Bert
>> >
>> > From: William A Rowe Jr [mailto:wrowe@rowe-clan.net]
>> > Sent: woensdag 25 november 2015 22:55
>> > To: httpd <de...@httpd.apache.org>
>> > Subject: Re: apr_token_* conclusions (was: Better casecmpstr[n]?)
>> >
>> > On Wed, Nov 25, 2015 at 3:52 PM, Christophe JAILLET <
>> christophe.jaillet@wanadoo.fr> wrote:
>> >> Hi,
>> >>
>> >> just in case off, gnome as a set of function g_ascii_...
>> >> (see
>> https://developer.gnome.org/glib/2.28/glib-String-Utility-Functions.html#g-ascii-strcasecmp
>> )
>> >
>> > Interesting, does anyone know offhand whether these perform the expected
>> > or the stated behavior under EBCDIC environments?
>>
>>

Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

Posted by William A Rowe Jr <wr...@rowe-clan.net>.
Sounds right... Actually a fusion between svn_cstring_* and several
existing ap_ and apr_ functions would be useful.

SVN folk, any objection to APR appropriating these API's?  20/20 hindsight,
is apr_cstring_ or shorter apr_cstr_ the way to go here?  You all had to
use the thing so I trust your preferences.  Either expresses locale C in my
mind, so they work for me.
On Nov 26, 2015 07:38, "Jim Jagielski" <ji...@jagunet.com> wrote:

> Yeah, SVN's 'svn_cstring_casecmp' and how it's used is
> pretty much inline with my thoughts on how httpd would
> use ours...
>
> > On Nov 25, 2015, at 5:10 PM, Bert Huijben <be...@qqmail.nl> wrote:
> >
> > We have a set of similar comparison functions in Subversion. I’m pretty
> sure we already had these in the time we still had ebcdic support on trunk.
> > (We removed that support years ago, but the code should still live on a
> branch)
> >
> > Bert
> >
> > From: William A Rowe Jr [mailto:wrowe@rowe-clan.net]
> > Sent: woensdag 25 november 2015 22:55
> > To: httpd <de...@httpd.apache.org>
> > Subject: Re: apr_token_* conclusions (was: Better casecmpstr[n]?)
> >
> > On Wed, Nov 25, 2015 at 3:52 PM, Christophe JAILLET <
> christophe.jaillet@wanadoo.fr> wrote:
> >> Hi,
> >>
> >> just in case off, gnome as a set of function g_ascii_...
> >> (see
> https://developer.gnome.org/glib/2.28/glib-String-Utility-Functions.html#g-ascii-strcasecmp
> )
> >
> > Interesting, does anyone know offhand whether these perform the expected
> > or the stated behavior under EBCDIC environments?
>
>

Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

Posted by Jim Jagielski <ji...@jaguNET.com>.
Yeah, SVN's 'svn_cstring_casecmp' and how it's used is
pretty much inline with my thoughts on how httpd would
use ours...

> On Nov 25, 2015, at 5:10 PM, Bert Huijben <be...@qqmail.nl> wrote:
> 
> We have a set of similar comparison functions in Subversion. I’m pretty sure we already had these in the time we still had ebcdic support on trunk.
> (We removed that support years ago, but the code should still live on a branch)
>  
> Bert
>  
> From: William A Rowe Jr [mailto:wrowe@rowe-clan.net] 
> Sent: woensdag 25 november 2015 22:55
> To: httpd <de...@httpd.apache.org>
> Subject: Re: apr_token_* conclusions (was: Better casecmpstr[n]?)
>  
> On Wed, Nov 25, 2015 at 3:52 PM, Christophe JAILLET <ch...@wanadoo.fr> wrote:
>> Hi,
>> 
>> just in case off, gnome as a set of function g_ascii_...
>> (see https://developer.gnome.org/glib/2.28/glib-String-Utility-Functions.html#g-ascii-strcasecmp)
>  
> Interesting, does anyone know offhand whether these perform the expected
> or the stated behavior under EBCDIC environments? 


RE: apr_token_* conclusions (was: Better casecmpstr[n]?)

Posted by Bert Huijben <be...@qqmail.nl>.
We have a set of similar comparison functions in Subversion. I’m pretty sure we already had these in the time we still had ebcdic support on trunk.

(We removed that support years ago, but the code should still live on a branch)

 

Bert

 

From: William A Rowe Jr [mailto:wrowe@rowe-clan.net] 
Sent: woensdag 25 november 2015 22:55
To: httpd <de...@httpd.apache.org>
Subject: Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

 

On Wed, Nov 25, 2015 at 3:52 PM, Christophe JAILLET <christophe.jaillet@wanadoo.fr <ma...@wanadoo.fr> > wrote:

Hi,

just in case off, gnome as a set of function g_ascii_...
(see https://developer.gnome.org/glib/2.28/glib-String-Utility-Functions.html#g-ascii-strcasecmp)

 

Interesting, does anyone know offhand whether these perform the expected

or the stated behavior under EBCDIC environments? 

 


Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

Posted by William A Rowe Jr <wr...@rowe-clan.net>.
On Wed, Nov 25, 2015 at 3:52 PM, Christophe JAILLET <
christophe.jaillet@wanadoo.fr> wrote:

> Hi,
>
> just in case off, gnome as a set of function g_ascii_...
> (see
> https://developer.gnome.org/glib/2.28/glib-String-Utility-Functions.html#g-ascii-strcasecmp
> )


Interesting, does anyone know offhand whether these perform the expected
or the stated behavior under EBCDIC environments?

Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

Posted by Christophe JAILLET <ch...@wanadoo.fr>.
Hi,

just in case off, gnome as a set of function g_ascii_...
(see 
https://developer.gnome.org/glib/2.28/glib-String-Utility-Functions.html#g-ascii-strcasecmp)

>
> I'm also waiting for feedback about the naming convention, I'd like to get
> this into APR yesterday and start building on it, but it's hard to 
> name our
> generic-posix tolower/toupper until we agree on the naming scheme :)
>
>


Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

Posted by Christophe JAILLET <ch...@wanadoo.fr>.
Le 25/11/2015 22:02, Jim Jagielski a écrit :
> In general, strcmp() is not implemented via strcmp.c
> (although if you do a source code search for strcmp, that's
> what you'll get). Most of the time it's implemented in
> assembly (strcmp.s) or simply leverages memcmp() where
> you aren't doing a byte by byte comparison but are doing
> a native memory word (32 or 64bit) comparison. This
> makes them super fast.
>
> Once we need to worry about case insensitivity, then
> we see a whole gamut of implementations; some use
> a mapped array as I did; some go char by char and call
> tolower() on each one; some do other things such as
> testing if isupper() before calling tolower() if needed.
> The word-based optimizations seem less viable, as seen
> in test results that I ran and Yann also verified (afaict)
>
> In my tests, my impl was faster on OSX and CentOS5 and 6.
> It's a very common function we use and with my test results
> it seemed to make sense to provide our own impl, esp if
> we decided that what we were really concerned about was
> comparing for equality, and so would be able to avoid
> the !strcasecmp logic leaping.
>
> If we decide that all this was for moot, that's fine.
> That's what these types of investigations and discussions
> are for.
>

Personally, my testing shows that faster/slower is not that self 
evident. On my machine, it depends of the length of the string.
With shorter strings (less than ~10 chars) Yann's proposal seems to be 
the best with the test program. What happens if the const char table is 
not in L1 cache? We still have the same speedup?
When strings are longer, std strncasecmp always win.

Short strings are our use case, so, I would say, why not using this 
implementation, after all?


My personal reticence would be:
    - it adds complexity to the code (one more function that looks 
really similar to existing ones)
    - the speed increase is 'only' 15% if I remember well latest numbers 
given by Yann
    - the speed increase is potentially platform/compiler/C library 
dependent.
    - it does not suppress (IMO) the 'switch' for going even faster to 
the right test
    - many off the tests against ASCII strings are hidden in apr 
functions (apr_table_get...)
Do we have an idea of the overall time spent in these str[n]casecmp 
function when processing a request?  15% of that time should be, IMO, 
quite low.
Does it worse the added complexity? For me, the answer is: not sure.

CJ

Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

Posted by Jim Jagielski <ji...@jaguNET.com>.
In general, strcmp() is not implemented via strcmp.c
(although if you do a source code search for strcmp, that's
what you'll get). Most of the time it's implemented in
assembly (strcmp.s) or simply leverages memcmp() where
you aren't doing a byte by byte comparison but are doing
a native memory word (32 or 64bit) comparison. This
makes them super fast.

Once we need to worry about case insensitivity, then
we see a whole gamut of implementations; some use
a mapped array as I did; some go char by char and call
tolower() on each one; some do other things such as
testing if isupper() before calling tolower() if needed.
The word-based optimizations seem less viable, as seen
in test results that I ran and Yann also verified (afaict)

In my tests, my impl was faster on OSX and CentOS5 and 6.
It's a very common function we use and with my test results
it seemed to make sense to provide our own impl, esp if
we decided that what we were really concerned about was
comparing for equality, and so would be able to avoid
the !strcasecmp logic leaping.

If we decide that all this was for moot, that's fine.
That's what these types of investigations and discussions
are for.

Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

Posted by William A Rowe Jr <wr...@rowe-clan.net>.
On Wed, Nov 25, 2015 at 1:50 PM, Jim Jagielski <ji...@jagunet.com> wrote:

> My point is that we use it to compare, for example,
> "FoobARski!" with "foOBArsKi!", not "Ébana?" with "ébana?" or "ebana?"
>
> In that way I mean "ascii"
>

But that isn't precisely what you wrote.  It happens to be ASCII here
because
we are corresponding in some offshoot of ISO646 (and yours might be
different
than mine, but gmail resolved it).  We have an EBCDIC modality for APR and
for httpd, and in that case, it is exactly what you meant (only A-Z, a-z)
but
not what you wrote :)


> Heck, we may as well say that we really aren't comparing
> "strings" at all, just arrays of 8bit characters :)
>

True, almost.  That's why I reiterated that the 8-bit values are largely
opaque to the "C" locale.  As long as A-Z, a-z behave as 'we' expect
with protocol conformance, we aren't so worried about the rest, and
that applies equally in ASCII or EBCDIC or Baudôt.

Anyway, that was my final post about the name... at this
> point I'd just like to see the actual improvement get completely
> folded in and used so we (and our users) can start enjoying the
> benefit.
>

In terms of your perceived optimization, I'd hate to have the result that
OS/X
hobbled users gain a faster strcmp implementation while others realize a
slower implementation, and I'm thinking of non-locale aware BSD.  That's
often the case with "optimizations" that don't consider what the clib
maintainers are able to accomplish with specialized knowledge of the
target architecture (especially MMX operations across arrays of characters).

And we long ago decided that APR really isn't cut out for those sorts of
optimizations.

I'm also waiting for feedback about the naming convention, I'd like to get
this into APR yesterday and start building on it, but it's hard to name our
generic-posix tolower/toupper until we agree on the naming scheme :)

Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

Posted by Jacob Champion <ch...@gmail.com>.
On Nov 25, 2015 1:10 PM, "Jim Jagielski" <ji...@jagunet.com> wrote:
> ... I think we are WAY overthinking naming here.

I overthink naming constantly, so there's an excellent chance that you're
absolutely correct! That said... your list only ended up convincing me that
APR needs better naming conventions. ;-D

(I do really appreciate this discussion, though. I promise I'm not trying
to stir the pot.)

--Jacob

Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

Posted by William A Rowe Jr <wr...@rowe-clan.net>.
On Wed, Nov 25, 2015 at 3:10 PM, Jim Jagielski <ji...@jagunet.com> wrote:

> In a library that has:
>
>         apr_pstrdup()
>         apr_pstrndup()
>         apr_pstrmemdup()
>

which are all semantically and mechanically different...


> and apr_pstrmemdup() and apr_pstrndup() are functionally
> the same,


Are you arguing to remove pstrmemdup?  That's a discussion to have
before APR 2.0, certainly, but it isn't functionally the same; bytes within
the copied pstrmemdup may be null, and it has a trailing null appended,
quite different than a memdup.


> as well as:
>
>         apr_strnatcasecmp()
>         apr_strnatcmp()
>
> neither of which use an 'n' variable to determine string
> size,


So there isn't a strnnat[case]cmp() function, are you offering a patch?


> yet is called 'strn...'


Indeed, possible to trip over with grep, for sure, but what is an 'atcmp'?
Seems clear enough to me, but are you proposing a rename?  APR 2.0
is the right time for that...


> whereas the dups use that
> 'n' in 'strndup' to signify that we have a size parameter
>

Indeed... follows the general stdc pattern.


> BUT its functionally equiv function apr_pstrmemdup() is
> called what it is instead of apr_pstrnmemdup()...
>
> ... I think we are WAY overthinking naming here.
>

People may be overthinking, and stumbling to come up with the
most concise and accurate name.  Renaming suggestions and
deprecation of the old names are welcome.  These are good
discussions to have, we made many improvements between
APR 0.9.x and APR 1.0.0 for exactly these reasons.

I agree we can call your proposal apr_str[n]casecmp because it
is a str[n]casecmp implementation - however, that doesn't tell the
user that it is "unusual" but equivalent function that breaks from
posix in that it deliberately chooses not to use the locale and is
primarily for wire protocols.  Thus the _token_ suggestion, but
I am open to other uniqifiers.  I'm not keen on coming up with
a new mishmash of str case len cmp equality blah that will be
harder for reviewers to decipher when reviewing commits.

I know you are in a hurry to just do something, but usually the stuff
we just hurry through many of us regret later, as piles of httpd.h cruft
can attest.  How many headers do you know that contain explicit
sighs?  APR has attempted to be more deliberate in its naming
conventions, by consensus.

You've certainly raised your ire many times at APR's unwillingness
to just modify an API within a major.minor revision, expressed very
little confidence that waiting for an APR release is ever a good idea,
and might be even perceived at hostile toward the entire APR
approach - which has never offered the shoot-from-the-hip approach
that earlier httpd releases enjoyed.  But these decisions were put
down in reaction to frequent breakage for developers prior to httpd 2,
and are in place precisely because APR wants other developers
beyond the world of httpd to have trust and confidence in the API
they are coding to.  Hopefully httpd module authors can enjoy the
same level of confidence.

Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

Posted by William A Rowe Jr <wr...@rowe-clan.net>.
On Wed, Nov 25, 2015 at 3:10 PM, Jim Jagielski <ji...@jagunet.com> wrote:

> In a library that has:
>
>         apr_pstrdup()
>         apr_pstrndup()
>         apr_pstrmemdup()
>

which are all semantically and mechanically different...


> and apr_pstrmemdup() and apr_pstrndup() are functionally
> the same,


Are you arguing to remove pstrmemdup?  That's a discussion to have
before APR 2.0, certainly, but it isn't functionally the same; bytes within
the copied pstrmemdup may be null, and it has a trailing null appended,
quite different than a memdup.


> as well as:
>
>         apr_strnatcasecmp()
>         apr_strnatcmp()
>
> neither of which use an 'n' variable to determine string
> size,


So there isn't a strnnat[case]cmp() function, are you offering a patch?


> yet is called 'strn...'


Indeed, possible to trip over with grep, for sure, but what is an 'atcmp'?
Seems clear enough to me, but are you proposing a rename?  APR 2.0
is the right time for that...


> whereas the dups use that
> 'n' in 'strndup' to signify that we have a size parameter
>

Indeed... follows the general stdc pattern.


> BUT its functionally equiv function apr_pstrmemdup() is
> called what it is instead of apr_pstrnmemdup()...
>
> ... I think we are WAY overthinking naming here.
>

People may be overthinking, and stumbling to come up with the
most concise and accurate name.  Renaming suggestions and
deprecation of the old names are welcome.  These are good
discussions to have, we made many improvements between
APR 0.9.x and APR 1.0.0 for exactly these reasons.

I agree we can call your proposal apr_str[n]casecmp because it
is a str[n]casecmp implementation - however, that doesn't tell the
user that it is "unusual" but equivalent function that breaks from
posix in that it deliberately chooses not to use the locale and is
primarily for wire protocols.  Thus the _token_ suggestion, but
I am open to other uniqifiers.  I'm not keen on coming up with
a new mishmash of str case len cmp equality blah that will be
harder for reviewers to decipher when reviewing commits.

I know you are in a hurry to just do something, but usually the stuff
we just hurry through many of us regret later, as piles of httpd.h cruft
can attest.  How many headers do you know that contain explicit
sighs?  APR has attempted to be more deliberate in its naming
conventions, by consensus.

You've certainly raised your ire many times at APR's unwillingness
to just modify an API within a major.minor revision, expressed very
little confidence that waiting for an APR release is ever a good idea,
and might be even perceived at hostile toward the entire APR
approach - which has never offered the shoot-from-the-hip approach
that earlier httpd releases enjoyed.  But these decisions were put
down in reaction to frequent breakage for developers prior to httpd 2,
and are in place precisely because APR wants other developers
beyond the world of httpd to have trust and confidence in the API
they are coding to.  Hopefully httpd module authors can enjoy the
same level of confidence.

Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

Posted by Jim Jagielski <ji...@jaguNET.com>.
In a library that has:

	apr_pstrdup()
	apr_pstrndup()
	apr_pstrmemdup()

and apr_pstrmemdup() and apr_pstrndup() are functionally
the same, as well as:

	apr_strnatcasecmp()
	apr_strnatcmp()

neither of which use an 'n' variable to determine string
size, yet is called 'strn...' whereas the dups use that
'n' in 'strndup' to signify that we have a size parameter
BUT its functionally equiv function apr_pstrmemdup() is
called what it is instead of apr_pstrnmemdup()...

... I think we are WAY overthinking naming here.

Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

Posted by Jim Jagielski <ji...@jaguNET.com>.
In a library that has:

	apr_pstrdup()
	apr_pstrndup()
	apr_pstrmemdup()

and apr_pstrmemdup() and apr_pstrndup() are functionally
the same, as well as:

	apr_strnatcasecmp()
	apr_strnatcmp()

neither of which use an 'n' variable to determine string
size, yet is called 'strn...' whereas the dups use that
'n' in 'strndup' to signify that we have a size parameter
BUT its functionally equiv function apr_pstrmemdup() is
called what it is instead of apr_pstrnmemdup()...

... I think we are WAY overthinking naming here.

Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

Posted by William A Rowe Jr <wr...@rowe-clan.net>.
On Wed, Nov 25, 2015 at 2:06 PM, Jacob Champion <ch...@gmail.com>
wrote:

> My two cents: I agree that another "name mangled" abbreviation is not
> particularly helpful, but I also agree with Jim's concern: "apr_token" made
> me immediately wonder what made this exclusive to HTTP tokens.
> Unfortunately I don't have much of an alternative suggestion. I have seen
> other frameworks refer to an "invariant" or "independent" locale/culture
> before; maybe that helps jog someone's creativity?
>
> Feel free to ignore my rambling. ;) My default naming strategy is to throw
> out random ideas.
>

I was thinking "_lcc_" for locale-C, short and to the point, but meaningful?

"_lcposix_" is also descriptive but long, and "_posix_" is much too
overloaded
with multiple meanings and implications (locales *are* a posix API :)

Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

Posted by William A Rowe Jr <wr...@rowe-clan.net>.
On Wed, Nov 25, 2015 at 2:06 PM, Jacob Champion <ch...@gmail.com>
wrote:

> My two cents: I agree that another "name mangled" abbreviation is not
> particularly helpful, but I also agree with Jim's concern: "apr_token" made
> me immediately wonder what made this exclusive to HTTP tokens.
> Unfortunately I don't have much of an alternative suggestion. I have seen
> other frameworks refer to an "invariant" or "independent" locale/culture
> before; maybe that helps jog someone's creativity?
>
> Feel free to ignore my rambling. ;) My default naming strategy is to throw
> out random ideas.
>

I was thinking "_lcc_" for locale-C, short and to the point, but meaningful?

"_lcposix_" is also descriptive but long, and "_posix_" is much too
overloaded
with multiple meanings and implications (locales *are* a posix API :)

Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

Posted by Jacob Champion <ch...@gmail.com>.
My two cents: I agree that another "name mangled" abbreviation is not
particularly helpful, but I also agree with Jim's concern: "apr_token" made
me immediately wonder what made this exclusive to HTTP tokens.
Unfortunately I don't have much of an alternative suggestion. I have seen
other frameworks refer to an "invariant" or "independent" locale/culture
before; maybe that helps jog someone's creativity?

Feel free to ignore my rambling. ;) My default naming strategy is to throw
out random ideas.

--Jacob
(on mobile, apologies for strange formatting)

Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

Posted by Jim Jagielski <ji...@jaguNET.com>.
My point is that we use it to compare, for example,
"FoobARski!" with "foOBArsKi!", not "Ébana?" with "ébana?" or "ebana?"

In that way I mean "ascii"

Heck, we may as well say that we really aren't comparing
"strings" at all, just arrays of 8bit characters :)

Anyway, that was my final post about the name... at this
point I'd just like to see the actual improvement get completely
folded in and used so we (and our users) can start enjoying the
benefit.

> On Nov 25, 2015, at 2:31 PM, William A Rowe Jr <wr...@rowe-clan.net> wrote:
> 
> On Wed, Nov 25, 2015 at 1:12 PM, Jim Jagielski <ji...@jagunet.com> wrote:
> 
> > On Nov 25, 2015, at 12:42 PM, William A Rowe Jr <wr...@rowe-clan.net> wrote:
> >
> > On Wed, Nov 25, 2015 at 10:17 AM, Jim Jagielski <ji...@jagunet.com> wrote:
> > What is the current status? Is this on hold?
> >
> > It is looking for a good name.  I'm happy with apr_token_strcasecmp
> > to best indicate its use-case and provenance.  Does that work for
> > everyone?
> 
> Still not super excited by the use of 'token' since it
> implies it should only be used for HTTP tokens and not
> in other cases where we use it to do ascii string comparisons
> (for example, when we check env-var settings or maybe directives)...
> yeah, they could also be lumped as 'tokens' I guess...
> 
> ap_casecmpastr[n] for Case-insensitive CoMParison of Ascii STRing
> 
> APR has a naming pattern for various functional groups - this won't be the last
> one that is impacted by POSIX-ing what should already be posix :)
> 
> Because this is (a) str[n]casecmp I'm pretty strongly against name mangling
> for the sake of name mangling, our consumers are C programmers, after all.
> Well, most of them anyways... and they should be familiar enough names
> for the Lua and PHP folks too.
> 
> And this isn't ASCII actually, we established that we want EBCDIC build of
> APR + HTTPD to have the same thing.  Not ASCII, but POSIX locale.  We
> will be careful about the description on that count.
> 
> Still -0.5 on introducing an ap_function, in light of the current mess in httpd.h.
> I'm only 10% of the way through reviewing @deprecated on that single header.
> 
> 


Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

Posted by William A Rowe Jr <wr...@rowe-clan.net>.
On Wed, Nov 25, 2015 at 1:12 PM, Jim Jagielski <ji...@jagunet.com> wrote:

>
> > On Nov 25, 2015, at 12:42 PM, William A Rowe Jr <wr...@rowe-clan.net>
> wrote:
> >
> > On Wed, Nov 25, 2015 at 10:17 AM, Jim Jagielski <ji...@jagunet.com> wrote:
> > What is the current status? Is this on hold?
> >
> > It is looking for a good name.  I'm happy with apr_token_strcasecmp
> > to best indicate its use-case and provenance.  Does that work for
> > everyone?
>
> Still not super excited by the use of 'token' since it
> implies it should only be used for HTTP tokens and not
> in other cases where we use it to do ascii string comparisons
> (for example, when we check env-var settings or maybe directives)...
> yeah, they could also be lumped as 'tokens' I guess...
>
> ap_casecmpastr[n] for Case-insensitive CoMParison of Ascii STRing
>

APR has a naming pattern for various functional groups - this won't be the
last
one that is impacted by POSIX-ing what should already be posix :)

Because this is (a) str[n]casecmp I'm pretty strongly against name mangling
for the sake of name mangling, our consumers are C programmers, after all.
Well, most of them anyways... and they should be familiar enough names
for the Lua and PHP folks too.

And this isn't ASCII actually, we established that we want EBCDIC build of
APR + HTTPD to have the same thing.  Not ASCII, but POSIX locale.  We
will be careful about the description on that count.

Still -0.5 on introducing an ap_function, in light of the current mess in
httpd.h.
I'm only 10% of the way through reviewing @deprecated on that single header.

Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

Posted by Jim Jagielski <ji...@jaguNET.com>.
> On Nov 25, 2015, at 12:42 PM, William A Rowe Jr <wr...@rowe-clan.net> wrote:
> 
> On Wed, Nov 25, 2015 at 10:17 AM, Jim Jagielski <ji...@jagunet.com> wrote:
> What is the current status? Is this on hold?
> 
> It is looking for a good name.  I'm happy with apr_token_strcasecmp
> to best indicate its use-case and provenance.  Does that work for 
> everyone?

Still not super excited by the use of 'token' since it
implies it should only be used for HTTP tokens and not
in other cases where we use it to do ascii string comparisons
(for example, when we check env-var settings or maybe directives)...
yeah, they could also be lumped as 'tokens' I guess...

ap_casecmpastr[n] for Case-insensitive CoMParison of Ascii STRing