You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@apr.apache.org by Joe Orton <jo...@manyfish.co.uk> on 2004/03/19 02:17:58 UTC

nasty apr-util xlate bug

I was investigating one AIX problem and came across a different one:

http://www.contactor.se/~dast/svn/archive-2003-03/1565.shtml

the cause seems to be the check_sbcs function in xlate.c.

The convset->ich is set up to convert from UTF-8 to ISO-8895-1;
check_sbcs fails to create the sbcs table for this case, of course.  But
the iconv state is not reset, and it was by design left part-way through
a UTF-8 sequence: so when a subsequent conv_buffer calls come along to
reuse iconv->ich, expecting it to be in the initial state, it all goes
horribly wrong.

The iconv(, NULL, NULL, NULL, NULL)-type calls to reset to the initial
conversion state segfaulted on AIX so it seems necessary to recreate the
cd, as per below... the same problem exists in the apr_iconv-based
implementation I guess.

Am I missing anything? Not sure why this problem doesn't affect more
users.

--- xlate/xlate.c	13 Feb 2004 09:55:27 -0000	1.19
+++ xlate/xlate.c	19 Mar 2004 01:06:48 -0000
@@ -132,6 +132,11 @@
         convset->ich = (iconv_t)-1;
 
         /* TODO: add the table to the cache */
+    } else {
+        /* reopen the iconv cd to get back to initial conversion
+         * state */
+        iconv_close(convset->ich);
+        convset->ich = iconv_open(convset->topage, convset->frompage);
     }
 }
 #elif APU_HAVE_APR_ICONV

Re: nasty apr-util xlate bug

Posted by Jeff Trawick <tr...@attglobal.net>.
Joe Orton wrote:

> Can APR just hard-code a couple of EBCDIC<->something translation tables
> (#ifdef I_AM_EBCDIC) to solve these problems?  I'd be suprised if the
> iconv() adds that much overhead vs the translation table, if you're
> going to do the iconv_open() anyway.

I won't worry about that for now; maybe Apache (or some other app that 
frequently does converstion of protocol data) needs to do something different 
if it turns out to be too slow

> Anyway, you don't disagree that the current code "single-byte
> translation detection" code is unsafe and needs to be removed?

delete the internal detection/use of sbcs table at will, but I'm not ready to 
give up on the convert-single-byte API, which is quite handly (a simplifying 
factor for the app) in some circumstances

Re: nasty apr-util xlate bug

Posted by Joe Orton <jo...@manyfish.co.uk>.
On Mon, Mar 22, 2004 at 09:45:20AM -0500, Jeff Trawick wrote:
> Joe Orton wrote:
> 
> >So it seems no optimisation is really possible here; what performance
> >problem is this code solving?  The overhead of iconv surely in
> >iconv_open() anyway: I think the right thing to do is remove the
> >check_sbcs() functions. 
> 
> On z/OS, where charset translation is a key feature of most network apps, 
> the applications I'm familar with used their own tables for fast lookup 
> when that was safe, and used iconv() otherwise.  I did not do my own 
> performance analysis to see what the real benefits are, but simply assumed 
> this was valuable for some real world scenarios where APR is used, such as 
> Apache on z/OS where lots of data will be translated.  So I put the local 
> table capability down in APR so the app wouldn't have to worry about it.  
> But that has an issue:  APR really doesn't know one charset from another, 
> unlike many apps which constantly deal with 
> some-EBCDIC-codepage<->US-ASCII, so APR doesn't inherently know when a 
> translation is always single-byte and needs to figure it out if it is going 
> to own the optimization task.

Can APR just hard-code a couple of EBCDIC<->something translation tables
(#ifdef I_AM_EBCDIC) to solve these problems?  I'd be suprised if the
iconv() adds that much overhead vs the translation table, if you're
going to do the iconv_open() anyway.

Anyway, you don't disagree that the current code "single-byte
translation detection" code is unsafe and needs to be removed?

> There is also an API simplicity issue.  In some code situations it is nice 
> to have an API that is as simple as an array access (see 
> apr_xlate_conv_byte()), but this makes sense only when it is known that the 
> translation is single-byte. Apache can use this in some common situations 
>  where it is not too much overhead, but it would probably want to maintain 
> its own tables for protocol data conversion in lieu of having to set up 
> calls to the more complicated apr_xlate_conv_buffer() all the time.

Re: nasty apr-util xlate bug

Posted by Jeff Trawick <tr...@attglobal.net>.
Joe Orton wrote:

> So it seems no optimisation is really possible here; what performance
> problem is this code solving?  The overhead of iconv surely in
> iconv_open() anyway: I think the right thing to do is remove the
> check_sbcs() functions. 

On z/OS, where charset translation is a key feature of most network apps, the 
applications I'm familar with used their own tables for fast lookup when that 
was safe, and used iconv() otherwise.  I did not do my own performance analysis 
to see what the real benefits are, but simply assumed this was valuable for 
some real world scenarios where APR is used, such as Apache on z/OS where lots 
of data will be translated.  So I put the local table capability down in APR so 
the app wouldn't have to worry about it.  But that has an issue:  APR really 
doesn't know one charset from another, unlike many apps which constantly deal 
with some-EBCDIC-codepage<->US-ASCII, so APR doesn't inherently know when a 
translation is always single-byte and needs to figure it out if it is going to 
own the optimization task.

There is also an API simplicity issue.  In some code situations it is nice to 
have an API that is as simple as an array access (see apr_xlate_conv_byte()), 
but this makes sense only when it is known that the translation is single-byte. 
  Apache can use this in some common situations where it is not too much 
overhead, but it would probably want to maintain its own tables for protocol 
data conversion in lieu of having to set up calls to the more complicated 
apr_xlate_conv_buffer() all the time.

Re: nasty apr-util xlate bug

Posted by Joe Orton <jo...@manyfish.co.uk>.
On Fri, Mar 19, 2004 at 05:08:00PM -0500, Jeff Trawick wrote:
> Joe Orton wrote:
> >Hmm.  Is this sbcs thing really safe at all?  Just because a character
> >set translation gives a particular mapping for 0x00-0xff in that order
> >why is it guaranteed that it will for any other ordering of bytes? 
> >
> >e.g. invent a mapping which does "0xff <end>" -> "0xff" but "0xff 0xf1"
> >-> "0x42".  I'd be surprised if a mapping between two real charsets does
> >*not* exist which does something like this, given the range of extremely
> >weird and wonderful charsets out there.
> 
> to rule out this issue completely, 256 calls to iconv() would be required 
> in check_sbcs() to test each proposed byte/char individually ;)

My example is a bit bogus since iconv presumably can't handle a charset
where "0xff" and "0xff 0xf1" represent different characters, but the
point is there.  Yes, logically, it would be necessary to do 256
individual iconv() calls to accurately deduce a single-byte lookup table
for a particular mapping if one is available.

So it seems no optimisation is really possible here; what performance
problem is this code solving?  The overhead of iconv surely in
iconv_open() anyway: I think the right thing to do is remove the
check_sbcs() functions. 

joe

Re: nasty apr-util xlate bug

Posted by Jeff Trawick <tr...@attglobal.net>.
Joe Orton wrote:
> Hmm.  Is this sbcs thing really safe at all?  Just because a character
> set translation gives a particular mapping for 0x00-0xff in that order
> why is it guaranteed that it will for any other ordering of bytes? 
> 
> e.g. invent a mapping which does "0xff <end>" -> "0xff" but "0xff 0xf1"
> -> "0x42".  I'd be surprised if a mapping between two real charsets does
> *not* exist which does something like this, given the range of extremely
> weird and wonderful charsets out there.

to rule out this issue completely, 256 calls to iconv() would be required in 
check_sbcs() to test each proposed byte/char individually ;)

or maybe do two tests...  one with the byte values incrementing and one with 
the byte values decrementing...  that would seem to drastically reduce the chances

not nice either way

how else to tell that this is simple table?

Re: nasty apr-util xlate bug

Posted by Joe Orton <jo...@manyfish.co.uk>.
Hmm.  Is this sbcs thing really safe at all?  Just because a character
set translation gives a particular mapping for 0x00-0xff in that order
why is it guaranteed that it will for any other ordering of bytes? 

e.g. invent a mapping which does "0xff <end>" -> "0xff" but "0xff 0xf1"
-> "0x42".  I'd be surprised if a mapping between two real charsets does
*not* exist which does something like this, given the range of extremely
weird and wonderful charsets out there.


Re: nasty apr-util xlate bug

Posted by Jeff Trawick <tr...@attglobal.net>.
Joe Orton wrote:

> The convset->ich is set up to convert from UTF-8 to ISO-8895-1;
> check_sbcs fails to create the sbcs table for this case, of course.  But
> the iconv state is not reset, and it was by design left part-way through
> a UTF-8 sequence: so when a subsequent conv_buffer calls come along to
> reuse iconv->ich, expecting it to be in the initial state, it all goes
> horribly wrong.

make sense, says the perpetrator

> The iconv(, NULL, NULL, NULL, NULL)-type calls to reset to the initial
> conversion state segfaulted on AIX so it seems necessary to recreate the
> cd, as per below... the same problem exists in the apr_iconv-based
> implementation I guess.

shrug

> Am I missing anything? Not sure why this problem doesn't affect more
> users.

ditto...  while most of my use of it has been with the simple conversions, I 
remember playing with mod_charset_lite and this code on Solaris 8 with 
multi-byte characters