You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@apr.apache.org by Joe Orton <jo...@manyfish.co.uk> on 2004/03/19 02:17:58 UTC
nasty apr-util xlate bug
I was investigating one AIX problem and came across a different one:
http://www.contactor.se/~dast/svn/archive-2003-03/1565.shtml
the cause seems to be the check_sbcs function in xlate.c.
The convset->ich is set up to convert from UTF-8 to ISO-8895-1;
check_sbcs fails to create the sbcs table for this case, of course. But
the iconv state is not reset, and it was by design left part-way through
a UTF-8 sequence: so when a subsequent conv_buffer calls come along to
reuse iconv->ich, expecting it to be in the initial state, it all goes
horribly wrong.
The iconv(, NULL, NULL, NULL, NULL)-type calls to reset to the initial
conversion state segfaulted on AIX so it seems necessary to recreate the
cd, as per below... the same problem exists in the apr_iconv-based
implementation I guess.
Am I missing anything? Not sure why this problem doesn't affect more
users.
--- xlate/xlate.c 13 Feb 2004 09:55:27 -0000 1.19
+++ xlate/xlate.c 19 Mar 2004 01:06:48 -0000
@@ -132,6 +132,11 @@
convset->ich = (iconv_t)-1;
/* TODO: add the table to the cache */
+ } else {
+ /* reopen the iconv cd to get back to initial conversion
+ * state */
+ iconv_close(convset->ich);
+ convset->ich = iconv_open(convset->topage, convset->frompage);
}
}
#elif APU_HAVE_APR_ICONV
Re: nasty apr-util xlate bug
Posted by Jeff Trawick <tr...@attglobal.net>.
Joe Orton wrote:
> Can APR just hard-code a couple of EBCDIC<->something translation tables
> (#ifdef I_AM_EBCDIC) to solve these problems? I'd be suprised if the
> iconv() adds that much overhead vs the translation table, if you're
> going to do the iconv_open() anyway.
I won't worry about that for now; maybe Apache (or some other app that
frequently does converstion of protocol data) needs to do something different
if it turns out to be too slow
> Anyway, you don't disagree that the current code "single-byte
> translation detection" code is unsafe and needs to be removed?
delete the internal detection/use of sbcs table at will, but I'm not ready to
give up on the convert-single-byte API, which is quite handly (a simplifying
factor for the app) in some circumstances
Re: nasty apr-util xlate bug
Posted by Joe Orton <jo...@manyfish.co.uk>.
On Mon, Mar 22, 2004 at 09:45:20AM -0500, Jeff Trawick wrote:
> Joe Orton wrote:
>
> >So it seems no optimisation is really possible here; what performance
> >problem is this code solving? The overhead of iconv surely in
> >iconv_open() anyway: I think the right thing to do is remove the
> >check_sbcs() functions.
>
> On z/OS, where charset translation is a key feature of most network apps,
> the applications I'm familar with used their own tables for fast lookup
> when that was safe, and used iconv() otherwise. I did not do my own
> performance analysis to see what the real benefits are, but simply assumed
> this was valuable for some real world scenarios where APR is used, such as
> Apache on z/OS where lots of data will be translated. So I put the local
> table capability down in APR so the app wouldn't have to worry about it.
> But that has an issue: APR really doesn't know one charset from another,
> unlike many apps which constantly deal with
> some-EBCDIC-codepage<->US-ASCII, so APR doesn't inherently know when a
> translation is always single-byte and needs to figure it out if it is going
> to own the optimization task.
Can APR just hard-code a couple of EBCDIC<->something translation tables
(#ifdef I_AM_EBCDIC) to solve these problems? I'd be suprised if the
iconv() adds that much overhead vs the translation table, if you're
going to do the iconv_open() anyway.
Anyway, you don't disagree that the current code "single-byte
translation detection" code is unsafe and needs to be removed?
> There is also an API simplicity issue. In some code situations it is nice
> to have an API that is as simple as an array access (see
> apr_xlate_conv_byte()), but this makes sense only when it is known that the
> translation is single-byte. Apache can use this in some common situations
> where it is not too much overhead, but it would probably want to maintain
> its own tables for protocol data conversion in lieu of having to set up
> calls to the more complicated apr_xlate_conv_buffer() all the time.
Re: nasty apr-util xlate bug
Posted by Jeff Trawick <tr...@attglobal.net>.
Joe Orton wrote:
> So it seems no optimisation is really possible here; what performance
> problem is this code solving? The overhead of iconv surely in
> iconv_open() anyway: I think the right thing to do is remove the
> check_sbcs() functions.
On z/OS, where charset translation is a key feature of most network apps, the
applications I'm familar with used their own tables for fast lookup when that
was safe, and used iconv() otherwise. I did not do my own performance analysis
to see what the real benefits are, but simply assumed this was valuable for
some real world scenarios where APR is used, such as Apache on z/OS where lots
of data will be translated. So I put the local table capability down in APR so
the app wouldn't have to worry about it. But that has an issue: APR really
doesn't know one charset from another, unlike many apps which constantly deal
with some-EBCDIC-codepage<->US-ASCII, so APR doesn't inherently know when a
translation is always single-byte and needs to figure it out if it is going to
own the optimization task.
There is also an API simplicity issue. In some code situations it is nice to
have an API that is as simple as an array access (see apr_xlate_conv_byte()),
but this makes sense only when it is known that the translation is single-byte.
Apache can use this in some common situations where it is not too much
overhead, but it would probably want to maintain its own tables for protocol
data conversion in lieu of having to set up calls to the more complicated
apr_xlate_conv_buffer() all the time.
Re: nasty apr-util xlate bug
Posted by Joe Orton <jo...@manyfish.co.uk>.
On Fri, Mar 19, 2004 at 05:08:00PM -0500, Jeff Trawick wrote:
> Joe Orton wrote:
> >Hmm. Is this sbcs thing really safe at all? Just because a character
> >set translation gives a particular mapping for 0x00-0xff in that order
> >why is it guaranteed that it will for any other ordering of bytes?
> >
> >e.g. invent a mapping which does "0xff <end>" -> "0xff" but "0xff 0xf1"
> >-> "0x42". I'd be surprised if a mapping between two real charsets does
> >*not* exist which does something like this, given the range of extremely
> >weird and wonderful charsets out there.
>
> to rule out this issue completely, 256 calls to iconv() would be required
> in check_sbcs() to test each proposed byte/char individually ;)
My example is a bit bogus since iconv presumably can't handle a charset
where "0xff" and "0xff 0xf1" represent different characters, but the
point is there. Yes, logically, it would be necessary to do 256
individual iconv() calls to accurately deduce a single-byte lookup table
for a particular mapping if one is available.
So it seems no optimisation is really possible here; what performance
problem is this code solving? The overhead of iconv surely in
iconv_open() anyway: I think the right thing to do is remove the
check_sbcs() functions.
joe
Re: nasty apr-util xlate bug
Posted by Jeff Trawick <tr...@attglobal.net>.
Joe Orton wrote:
> Hmm. Is this sbcs thing really safe at all? Just because a character
> set translation gives a particular mapping for 0x00-0xff in that order
> why is it guaranteed that it will for any other ordering of bytes?
>
> e.g. invent a mapping which does "0xff <end>" -> "0xff" but "0xff 0xf1"
> -> "0x42". I'd be surprised if a mapping between two real charsets does
> *not* exist which does something like this, given the range of extremely
> weird and wonderful charsets out there.
to rule out this issue completely, 256 calls to iconv() would be required in
check_sbcs() to test each proposed byte/char individually ;)
or maybe do two tests... one with the byte values incrementing and one with
the byte values decrementing... that would seem to drastically reduce the chances
not nice either way
how else to tell that this is simple table?
Re: nasty apr-util xlate bug
Posted by Joe Orton <jo...@manyfish.co.uk>.
Hmm. Is this sbcs thing really safe at all? Just because a character
set translation gives a particular mapping for 0x00-0xff in that order
why is it guaranteed that it will for any other ordering of bytes?
e.g. invent a mapping which does "0xff <end>" -> "0xff" but "0xff 0xf1"
-> "0x42". I'd be surprised if a mapping between two real charsets does
*not* exist which does something like this, given the range of extremely
weird and wonderful charsets out there.
Re: nasty apr-util xlate bug
Posted by Jeff Trawick <tr...@attglobal.net>.
Joe Orton wrote:
> The convset->ich is set up to convert from UTF-8 to ISO-8895-1;
> check_sbcs fails to create the sbcs table for this case, of course. But
> the iconv state is not reset, and it was by design left part-way through
> a UTF-8 sequence: so when a subsequent conv_buffer calls come along to
> reuse iconv->ich, expecting it to be in the initial state, it all goes
> horribly wrong.
make sense, says the perpetrator
> The iconv(, NULL, NULL, NULL, NULL)-type calls to reset to the initial
> conversion state segfaulted on AIX so it seems necessary to recreate the
> cd, as per below... the same problem exists in the apr_iconv-based
> implementation I guess.
shrug
> Am I missing anything? Not sure why this problem doesn't affect more
> users.
ditto... while most of my use of it has been with the simple conversions, I
remember playing with mod_charset_lite and this code on Solaris 8 with
multi-byte characters