You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@apr.apache.org by Luke Kenneth Casson Leighton <lk...@samba-tng.org> on 2001/06/12 17:22:25 UTC

apr unicode-16 lib.

for various reasons i am prompted to ask,

how would the idea of having an apr_ucs16 set of routines,
apr_wstrcat, apr_wstrcpy, apr_wtolower, apr_wtoupper etc.,
be received?

on nt, it's easy: straightforward usage of the NT 
wstrcat, wstrcpy etc. lines.

on unix, it's slightly more tricky, but easily doable.
[and example code exists in samba, anyway:
they've tried it there, but never yet completed it
satisfactorily]

iirc, glib has a unicode library, however it is ucs32 not
ucs16, and depends on glib, which is an N-mbytes install,
and not what i need, iow.

how about it? :)

luke

Re: apr unicode-16 lib.

Posted by "William A. Rowe, Jr." <ad...@rowe-clan.net>.
From: "Greg Stein" <gs...@lyra.org>
Sent: Wednesday, June 13, 2001 3:01 PM


> On Wed, Jun 13, 2001 at 05:35:01PM +0200, Luke Kenneth Casson Leighton wrote:
> > i don't mind.  as long as there's something that can be used
> > as the basis to write an APR-based SMB server, and it's capable
> > of handling ucs2 in intel-native format off-the-wire.
> 
> The apr_iconv stuff should be able to do UCS-2 -> UTF-8 conversion. If it
> can't, then it is useless :-)
> 
> [ I'm guessing it already can; in any case, the API is there for this ]

Yup...

> [ hmm. apr/include/arch/unix/i18n.h has some conversion code; no idea why ]

It's a secondary implementation, could be grabbed by apr_xlate if the implementation
is 'less than secure', and was primarly for Win32 since we didn't have xlate
conversion semantics (xlate lets us convert 'just enough' bytes, WinNT expects to
have a large enough buffer for all or nothing.)

Bill



Re: apr unicode-16 lib.

Posted by Greg Stein <gs...@lyra.org>.
On Wed, Jun 13, 2001 at 05:35:01PM +0200, Luke Kenneth Casson Leighton wrote:
> On Wed, Jun 13, 2001 at 09:57:41AM -0500, William A. Rowe, Jr. wrote:
> > Then let's not start adding things willy nilly.  We have apr_iconv due to
> > portability, let's build upon that.  It should be across character sets, so
> > we can handle this stuff in an opaque manner.

Agreed!

>...
> i don't mind.  as long as there's something that can be used
> as the basis to write an APR-based SMB server, and it's capable
> of handling ucs2 in intel-native format off-the-wire.

The apr_iconv stuff should be able to do UCS-2 -> UTF-8 conversion. If it
can't, then it is useless :-)

[ I'm guessing it already can; in any case, the API is there for this ]

[ hmm. apr/include/arch/unix/i18n.h has some conversion code; no idea why ]

> [i can auto-generate some code to do the conversion, it
> doesn't matter what the internal format is in, ultimately, as
> long as no information is lost,

UTF-8 is a lossless encoding of UCS-2 (or UCS-4).

> and there's a secondary
> consideration to speed.  samba is full of code that
> converts ucs2 to ascii by dropping the high byte.]

That is way broken :-)

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/

Re: apr unicode-16 lib.

Posted by jean-frederic clere <jf...@fujitsu-siemens.com>.
"William A. Rowe, Jr." wrote:
> 
> From: "jean-frederic clere" <jf...@fujitsu-siemens.com>
> Sent: Wednesday, June 13, 2001 11:48 AM
> 
> > Luke Kenneth Casson Leighton wrote:
> > >
> > > On Wed, Jun 13, 2001 at 09:57:41AM -0500, William A. Rowe, Jr. wrote:
> > >
> > > > Then let's not start adding things willy nilly.  We have apr_iconv due to
> > > > portability, let's build upon that.  It should be across character sets, so
> > > > we can handle this stuff in an opaque manner.
> > >
> > > ack.
> >
> > Great!
> >
> > By the way I am not 100% happy with apr-iconv, I am thinking of returning status
> > instead the actual size_t. Something like:
> >
> > apr_status_t apr_iconv(iconv_t cd, const char **inbuf, size_t
> > *inbytesleft,
> >         char **outbuf, size_t *outbytesleft, size_t *num_change)
> >
> > Any comments? - If no one complains before tomorrow I will start doing these
> > changes - (Well I have already started).
> 
> If we can fail for more that a single reason, then yes, that's an apr_status_t.
> 
> Commit away :-)  We ought to start looking at compatibility to apr_xlate.c as well,
> perhaps by tying to one module where we use the iconv package, and to another module
> where we user apr_iconv.  Haven't given that enough thought yet.

I will commit tomorrow - Because what I have is not finished -
I am using apr_set_os_error() and apr_get_os_error() to get the status at the
place I want. That is too ugly to be committed!

Re: apr unicode-16 lib.

Posted by "William A. Rowe, Jr." <wr...@rowe-clan.net>.
From: "jean-frederic clere" <jf...@fujitsu-siemens.com>
Sent: Wednesday, June 13, 2001 11:48 AM


> Luke Kenneth Casson Leighton wrote:
> > 
> > On Wed, Jun 13, 2001 at 09:57:41AM -0500, William A. Rowe, Jr. wrote:
> > 
> > > Then let's not start adding things willy nilly.  We have apr_iconv due to
> > > portability, let's build upon that.  It should be across character sets, so
> > > we can handle this stuff in an opaque manner.
> > 
> > ack.
> 
> Great!
> 
> By the way I am not 100% happy with apr-iconv, I am thinking of returning status
> instead the actual size_t. Something like:
>                                                                           
> apr_status_t apr_iconv(iconv_t cd, const char **inbuf, size_t
> *inbytesleft,                  
>         char **outbuf, size_t *outbytesleft, size_t *num_change)
> 
> Any comments? - If no one complains before tomorrow I will start doing these
> changes - (Well I have already started).                                    

If we can fail for more that a single reason, then yes, that's an apr_status_t.

Commit away :-)  We ought to start looking at compatibility to apr_xlate.c as well,
perhaps by tying to one module where we use the iconv package, and to another module
where we user apr_iconv.  Haven't given that enough thought yet.





Re: apr unicode-16 lib.

Posted by jean-frederic clere <jf...@fujitsu-siemens.com>.
Luke Kenneth Casson Leighton wrote:
> 
> On Wed, Jun 13, 2001 at 09:57:41AM -0500, William A. Rowe, Jr. wrote:
> 
> > Then let's not start adding things willy nilly.  We have apr_iconv due to
> > portability, let's build upon that.  It should be across character sets, so
> > we can handle this stuff in an opaque manner.
> 
> ack.

Great!

By the way I am not 100% happy with apr-iconv, I am thinking of returning status
instead the actual size_t. Something like:
                                                                          
apr_status_t apr_iconv(iconv_t cd, const char **inbuf, size_t
*inbytesleft,                  
        char **outbuf, size_t *outbytesleft, size_t *num_change)

Any comments? - If no one complains before tomorrow I will start doing these
changes - (Well I have already started).                                    



> 
> i don't mind.  as long as there's something that can be used
> as the basis to write an APR-based SMB server, and it's capable
> of handling ucs2 in intel-native format off-the-wire.
> 
> [i can auto-generate some code to do the conversion, it
> doesn't matter what the internal format is in, ultimately, as
> long as no information is lost, and there's a secondary
> consideration to speed.  samba is full of code that
> converts ucs2 to ascii by dropping the high byte.]
> 
> that's the driving factor, here.
> 
> lukes

Re: apr unicode-16 lib.

Posted by Luke Kenneth Casson Leighton <lk...@samba-tng.org>.
On Wed, Jun 13, 2001 at 09:57:41AM -0500, William A. Rowe, Jr. wrote:

> Then let's not start adding things willy nilly.  We have apr_iconv due to
> portability, let's build upon that.  It should be across character sets, so
> we can handle this stuff in an opaque manner.

ack.

i don't mind.  as long as there's something that can be used
as the basis to write an APR-based SMB server, and it's capable
of handling ucs2 in intel-native format off-the-wire.

[i can auto-generate some code to do the conversion, it
doesn't matter what the internal format is in, ultimately, as
long as no information is lost, and there's a secondary
consideration to speed.  samba is full of code that
converts ucs2 to ascii by dropping the high byte.]

that's the driving factor, here.

lukes


Re: apr unicode-16 lib.

Posted by "William A. Rowe, Jr." <ad...@rowe-clan.net>.
From: "Luke Kenneth Casson Leighton" <lk...@samba-tng.org>
Sent: Wednesday, June 13, 2001 7:17 AM


> On Tue, Jun 12, 2001 at 11:46:30AM -0500, William A. Rowe, Jr. wrote:
> > From: "Luke Kenneth Casson Leighton" <lk...@samba-tng.org>
> > Sent: Tuesday, June 12, 2001 10:22 AM
> > 
> > > how would the idea of having an apr_ucs16 set of routines,
> > > apr_wstrcat, apr_wstrcpy, apr_wtolower, apr_wtoupper etc.,
> > > be received?
> > 
> > Well, since apr_isfoo apr_tofoo was 'reinvented', I don't see a
> > huge problem.
>  
> cool.

But please take a look first at the dialog that's started under iconv,
this is a one way ticket to solving one specific problem.  If we implement
under apr_iconv, we can accomplish a lot more.  mod_autoindex could get
exactly 20 characters of description, even when these are 20 bytes, 33
bytes or 40 bytes.

> > > on nt, it's easy: straightforward usage of the NT 
> > > wstrcat, wstrcpy etc. lines.
> > 
> > These are the folks who never read the "Security Implications" of ucs-8 
> > leaving 40% of all IIS webservers still vulnerable, so I'm dubious :-)
>  
> *grin*.
> 
> btw, samba #defines strcpy to ERROR_USE_SAFE_STRCPY_INSTEAD etc.
> 
> sorry, forgot about this.  okay, rewrite that: how
> about an equivalent apr_pwstrcat, apr_pwstrcpy with all
> the safety / security / paranoia therein?

Again, why we shouldn't 'do' simply a Unicode wrapper that is inferior.

> > Well, how about a simple question.  Why restrain ourselves to ucs2?
> 
> because it's what NT has: NT doesn't have 32-bit (ucs4?) unicode, afaik, 
> only 16-bit (ucs2?)

Ok, NT uses 32 bit unicode, later 2000 releases add the double-word pairs.

But why are you exposing for WinNT?  Here's the kick, apr is a byte oriented
interface to the OS.  It will never be otherwise.

When I say byte oriented, I mean any internationalization needs to use 
something simple and transparent, such as utf-8.  That's what we are doing,
right now.  If you want to extend unicode treatment internally as accessors
(which I did with the fast and safe utf8/ucs2 conversion) then I'm all for it,
if it helps us.  But those are internals.

The rest of the world is still byte oriented.  This is a compatibility layer,
so we need to focus apr in that direction.  

> > Can iconv/apr_iconv provide this in a charset-opaque manner?  That is, if
> > I want three 'characters' in shift-jis, can it give me the right number
> > of bytes?  The reason is simple, Unicode is already splintered into a
> > multi-word character set anyways.  I suspect it's easier to just get it
> > right, knowing the apr_xlate that's been opened, and asking for the char
> > len v.s. the byte len (sizeof) and providing the strcpy/cmp, etc.
> 
> you need to be able to wtoupper, wtolower etc.  that requires
> a lookup table.  samba has an optimised lookup table of the
> standard ucs2 upper/lower conversion tables that is small enough
> to fit into the 2nd-level cache of an intel processor.

Then let's not start adding things willy nilly.  We have apr_iconv due to
portability, let's build upon that.  It should be across character sets, so
we can handle this stuff in an opaque manner.

Bill




Re: apr unicode-16 lib.

Posted by Luke Kenneth Casson Leighton <lk...@samba-tng.org>.
On Tue, Jun 12, 2001 at 11:46:30AM -0500, William A. Rowe, Jr. wrote:
> From: "Luke Kenneth Casson Leighton" <lk...@samba-tng.org>
> Sent: Tuesday, June 12, 2001 10:22 AM
> 
> 
> > for various reasons i am prompted to ask,
> > 
> > how would the idea of having an apr_ucs16 set of routines,
> > apr_wstrcat, apr_wstrcpy, apr_wtolower, apr_wtoupper etc.,
> > be received?
> 
> Well, since apr_isfoo apr_tofoo was 'reinvented', I don't see a
> huge problem.
 
cool.

> > on nt, it's easy: straightforward usage of the NT 
> > wstrcat, wstrcpy etc. lines.
> 
> These are the folks who never read the "Security Implications" of ucs-8 
> leaving 40% of all IIS webservers still vulnerable, so I'm dubious :-)
 
*grin*.

btw, samba #defines strcpy to ERROR_USE_SAFE_STRCPY_INSTEAD
etc.

sorry, forgot about this.  okay, rewrite that: how
about an equivalent apr_pwstrcat, apr_pwstrcpy with all
the safety / security / paranoia therein?

> Well, how about a simple question.  Why restrain ourselves to ucs2?

because it's what NT has: NT doesn't have 32-bit (ucs4?) unicode, afaik, 
only 16-bit (ucs2?)

writing your own ucs4 library, forget it, might as well adopt the
glib one.  but iirc, the glib one _only_ does ucs4, not ucs2.


> (No such thing as ucs16/32, it's ucs2/4).
> 

ack.

> Can iconv/apr_iconv provide this in a charset-opaque manner?  That is, if
> I want three 'characters' in shift-jis, can it give me the right number
> of bytes?  The reason is simple, Unicode is already splintered into a
> multi-word character set anyways.  I suspect it's easier to just get it
> right, knowing the apr_xlate that's been opened, and asking for the char
> len v.s. the byte len (sizeof) and providing the strcpy/cmp, etc.

you need to be able to wtoupper, wtolower etc.  that requires
a lookup table.  samba has an optimised lookup table of the
standard ucs2 upper/lower conversion tables that is small enough
to fit into the 2nd-level cache of an intel processor.

luke

Re: apr unicode-16 lib.

Posted by jean-frederic clere <jf...@fujitsu-siemens.com>.
"William A. Rowe, Jr." wrote:
> 
> From: "Luke Kenneth Casson Leighton" <lk...@samba-tng.org>
> Sent: Tuesday, June 12, 2001 10:22 AM
> 
> > for various reasons i am prompted to ask,
> >
> > how would the idea of having an apr_ucs16 set of routines,
> > apr_wstrcat, apr_wstrcpy, apr_wtolower, apr_wtoupper etc.,
> > be received?
> 
> Well, since apr_isfoo apr_tofoo was 'reinvented', I don't see a
> huge problem.
> 
> > on nt, it's easy: straightforward usage of the NT
> > wstrcat, wstrcpy etc. lines.
> 
> These are the folks who never read the "Security Implications" of ucs-8
> leaving 40% of all IIS webservers still vulnerable, so I'm dubious :-)
> 
> > on unix, it's slightly more tricky, but easily doable.
> > [and example code exists in samba, anyway:
> > they've tried it there, but never yet completed it
> > satisfactorily]
> >
> > iirc, glib has a unicode library, however it is ucs32 not
> > ucs16, and depends on glib, which is an N-mbytes install,
> > and not what i need, iow.
> >
> > how about it? :)
> 
> Well, how about a simple question.  Why restrain ourselves to ucs2?
> (No such thing as ucs16/32, it's ucs2/4).
> 
> Can iconv/apr_iconv provide this in a charset-opaque manner?

Well I have also started to think that apr-iconv could be a place to these
things...
We have there the size information.
The code could be:
apr_iconv_open("shift-jis","shift-jis",ctx);
apr_iconv_wstrcat(out,in);
...

>
>  That is, if
> I want three 'characters' in shift-jis, can it give me the right number
> of bytes?  The reason is simple, Unicode is already splintered into a
> multi-word character set anyways.  I suspect it's easier to just get it
> right, knowing the apr_xlate that's been opened, and asking for the char
> len v.s. the byte len (sizeof) and providing the strcpy/cmp, etc.
> 
> Bill

Re: apr unicode-16 lib.

Posted by "William A. Rowe, Jr." <ad...@rowe-clan.net>.
From: "Luke Kenneth Casson Leighton" <lk...@samba-tng.org>
Sent: Tuesday, June 12, 2001 10:22 AM


> for various reasons i am prompted to ask,
> 
> how would the idea of having an apr_ucs16 set of routines,
> apr_wstrcat, apr_wstrcpy, apr_wtolower, apr_wtoupper etc.,
> be received?

Well, since apr_isfoo apr_tofoo was 'reinvented', I don't see a
huge problem.

> on nt, it's easy: straightforward usage of the NT 
> wstrcat, wstrcpy etc. lines.

These are the folks who never read the "Security Implications" of ucs-8 
leaving 40% of all IIS webservers still vulnerable, so I'm dubious :-)

> on unix, it's slightly more tricky, but easily doable.
> [and example code exists in samba, anyway:
> they've tried it there, but never yet completed it
> satisfactorily]
>
> iirc, glib has a unicode library, however it is ucs32 not
> ucs16, and depends on glib, which is an N-mbytes install,
> and not what i need, iow.
> 
> how about it? :)

Well, how about a simple question.  Why restrain ourselves to ucs2?
(No such thing as ucs16/32, it's ucs2/4).

Can iconv/apr_iconv provide this in a charset-opaque manner?  That is, if
I want three 'characters' in shift-jis, can it give me the right number
of bytes?  The reason is simple, Unicode is already splintered into a
multi-word character set anyways.  I suspect it's easier to just get it
right, knowing the apr_xlate that's been opened, and asking for the char
len v.s. the byte len (sizeof) and providing the strcpy/cmp, etc.

Bill