You are viewing a plain text version of this content. The canonical link for it is here.

Posted to apreq-dev@httpd.apache.org by Stas Bekman <st...@stason.org> on 2005/03/17 02:56:31 UTC

unicode (was Re: today's code review..)

Joe Schaefer wrote:
> Joe Schaefer <jo...@sunstarsys.com> writes:
> 
> [...]
> 
> 
>>But beyond that, I'm not sure what we can do right now in apreq.
>>The %uXXXX escapes are rare enough that I don't think we should
>>adjust our decode-APIs around them.
> 
> 
> FWIW, here's some food for thought:
> 
>       http://intertwingly.net/stories/2004/04/14/i18n.html
> 
> I have some rough ideas about how to incorporate this into the
> url-decoding stuff, but nothing solid at the moment.  IMO the
> XForms and WHATWG specs are essentially useless right now, 
> so I suggest we be either pursue a simple divination strategy
> based on this document:
> 
>       all %-encodings are 7-bit ->  mark as APREQ_CHARSET_ASCII, 
>       some %-encoding is 8-bit -> divine the encoding, mark result
>                                   as either iso-8859-1 or utf8.
> 

What do you mean, Joe? To automatically convert any input to a predefined 
format? Or do you mean something else?

-- 
__________________________________________________________________
Stas Bekman            JAm_pH ------> Just Another mod_perl Hacker
http://stason.org/     mod_perl Guide ---> http://perl.apache.org
mailto:stas@stason.org http://use.perl.org http://apacheweek.com
http://modperlbook.org http://apache.org   http://ticketmaster.com

Re: unicode

Posted by Joe Schaefer <jo...@sunstarsys.com>.

Max Kellermann <ma...@duempel.org> writes:

[...]

>> 4) expose a utility function which converts cp-1252 strings
>> to utf8.
>
> We should also have a flag "always UTF-8, please", which makes
> libapreq2 transparently convert everything to UTF-8.

We have that already: it's either called apreq_register_parser() 
or apreq_parser_set() :-).

-- 
Joe Schaefer

Re: unicode

Posted by Joe Schaefer <jo...@sunstarsys.com>.

Max Kellermann <ma...@duempel.org> writes:

> On 2005/03/17 17:09, Joe Schaefer <jo...@sunstarsys.com> wrote:
>> 1) in apreq.h, add
>
> +1
>
>> 2) in apreq_param.h: replace the "utf8" stuff with
>
> +1
>
>> 3) upgrade apreq_param_decode() and apreq_param_decodev()
>> to report the charset detected (probably via the return value
>> so people can't just ignore it).  The divination logic would 
>> go like this:
>
> Your algorithm is probably the best we can do.. if that works out
> for all important clients, +1 from me.

Ok, the basics are in the branch now.  It's based largely on
Sam Ruby's page, so please review the code carefully.  I'm
not sure we've got the latin1/cp1252 stuff right yet.

-- 
Joe Schaefer

Re: unicode

Posted by Max Kellermann <ma...@duempel.org>.

On 2005/03/17 17:09, Joe Schaefer <jo...@sunstarsys.com> wrote:
> 1) in apreq.h, add

+1

> 2) in apreq_param.h: replace the "utf8" stuff with

+1

> 3) upgrade apreq_param_decode() and apreq_param_decodev()
> to report the charset detected (probably via the return value
> so people can't just ignore it).  The divination logic would 
> go like this:

Your algorithm is probably the best we can do.. if that works out for
all important clients, +1 from me.

> 4) expose a utility function which converts cp-1252 strings
> to utf8.

We should also have a flag "always UTF-8, please", which makes
libapreq2 transparently convert everything to UTF-8.

> 5) Replace the perl-glue's $param->is_utf8() method with charset().
> When we have to expose an cp-1252 encoded param to a perl user, 
> we use the utility function from (4) and translate the data to utf8 
> (in the SvPV; we don't modify the apreq_param_t at all).

0, because I havn't worked with the perl glue yet. But it sounds ok.

Max

Re: unicode

Posted by Joe Schaefer <jo...@sunstarsys.com>.

Joe Schaefer <jo...@sunstarsys.com> writes:

[...]

>  I'm not sure how we should handle this, but two options seem obvious:
> translate that to utf8,  or add windows-1252 to our charset list.

At the moment, I prefer the latter.  Here's what I think will 
work best for apreq; please critique.

Thanks!

==================================================

1) in apreq.h, add

/** Character encodings. */
typedef enum {
    APREQ_CHARSET_ASCII  =0,
    APREQ_CHARSET_LATIN1 =1, /* ISO-8859-1   */
    APREQ_CHARSET_CP_1252=2, /* Windows-1252 */
    APREQ_CHARSET_UTF8   =8
} apreq_charset_t;

==================================================

2) in apreq_param.h: replace the "utf8" stuff with

/** Sets the character encoding for this parameter. */
static APR_INLINE
apreq_charset_t apreq_param_charset_set(apreq_param_t *p ,unsigned char c) {
    unsigned char c = APREQ_FLAGS_GET(p, APREQ_CHARSET);
    APREQ_FLAGS_SET(p->flags, APREQ_CHARSET, c);
    return c;
}

/** Gets the character encoding for this parameter. */
static APR_INLINE
apreq_charset_t apreq_param_charset_get(apreq_param_t *p) {
    return APREQ_FLAGS_GET(p->flags, APREQ_CHARSET);
}

==================================================

3) upgrade apreq_param_decode() and apreq_param_decodev()
to report the charset detected (probably via the return value
so people can't just ignore it).  The divination logic would 
go like this:

      a) Presume the charset is 7-bit ASCII;  
         if that cannot possibly  be true, then

      b) Presume the data was utf8 encoded. If
         that cannot possibly be true, then

      c) Presume the data was encoded using iso-8859-1,
         unless control characters (0x80 - 0x9F) appear.

      d) Mark it windows-1252.

==================================================

4) expose a utility function which converts cp-1252 strings
to utf8.

==================================================

5) Replace the perl-glue's $param->is_utf8() method with charset().
When we have to expose an cp-1252 encoded param to a perl user, 
we use the utility function from (4) and translate the data to utf8 
(in the SvPV; we don't modify the apreq_param_t at all).

-- 
Joe Schaefer

Re: unicode

Posted by Joe Schaefer <jo...@sunstarsys.com>.

Max Kellermann <ma...@duempel.org> writes:

> On 2005/03/17 02:56, Stas Bekman <st...@stason.org> wrote:
>> >      all %-encodings are 7-bit ->  mark as APREQ_CHARSET_ASCII, 
>> >      some %-encoding is 8-bit -> divine the encoding, mark result
>> >                                  as either iso-8859-1 or utf8.
>> >
>> 
>> What do you mean, Joe? To automatically convert any input to a
>> predefined format? Or do you mean something else?
>
> No, just to mark the value as "ASCII" or "UTF-8", and to let the
> application decide how to handle it. No conversion.

Yup, but there's one snafu here: AFAICT the windows-1252 encodings 
must be mapped to utf8 (none of the 27 chars mentioned in Sam Ruby's 
survival guide translate to iso-8859-1).  I'm not sure how we should 
handle this, but two options seem obvious: translate that to utf8, 
or add windows-1252 to our charset list.

-- 
Joe Schaefer

Re: unicode (was Re: today's code review..)

Posted by Stas Bekman <st...@stason.org>.

Max Kellermann wrote:
> On 2005/03/17 02:56, Stas Bekman <st...@stason.org> wrote:
> 
>>>     all %-encodings are 7-bit ->  mark as APREQ_CHARSET_ASCII, 
>>>     some %-encoding is 8-bit -> divine the encoding, mark result
>>>                                 as either iso-8859-1 or utf8.
>>>
>>
>>What do you mean, Joe? To automatically convert any input to a predefined 
>>format? Or do you mean something else?
> 
> 
> No, just to mark the value as "ASCII" or "UTF-8", and to let the
> application decide how to handle it. No conversion.

Thanks Max, I have a long way to catch up with all the new changes...


-- 
__________________________________________________________________
Stas Bekman            JAm_pH ------> Just Another mod_perl Hacker
http://stason.org/     mod_perl Guide ---> http://perl.apache.org
mailto:stas@stason.org http://use.perl.org http://apacheweek.com
http://modperlbook.org http://apache.org   http://ticketmaster.com

Re: unicode (was Re: today's code review..)

Posted by Max Kellermann <ma...@duempel.org>.

On 2005/03/17 02:56, Stas Bekman <st...@stason.org> wrote:
> >      all %-encodings are 7-bit ->  mark as APREQ_CHARSET_ASCII, 
> >      some %-encoding is 8-bit -> divine the encoding, mark result
> >                                  as either iso-8859-1 or utf8.
> >
> 
> What do you mean, Joe? To automatically convert any input to a predefined 
> format? Or do you mean something else?

No, just to mark the value as "ASCII" or "UTF-8", and to let the
application decide how to handle it. No conversion.

Max