You are viewing a plain text version of this content. The canonical link for it is here.
Posted to apreq-dev@httpd.apache.org by Stas Bekman <st...@stason.org> on 2005/03/17 02:56:31 UTC
unicode (was Re: today's code review..)
Joe Schaefer wrote:
> Joe Schaefer <jo...@sunstarsys.com> writes:
>
> [...]
>
>
>>But beyond that, I'm not sure what we can do right now in apreq.
>>The %uXXXX escapes are rare enough that I don't think we should
>>adjust our decode-APIs around them.
>
>
> FWIW, here's some food for thought:
>
> http://intertwingly.net/stories/2004/04/14/i18n.html
>
> I have some rough ideas about how to incorporate this into the
> url-decoding stuff, but nothing solid at the moment. IMO the
> XForms and WHATWG specs are essentially useless right now,
> so I suggest we be either pursue a simple divination strategy
> based on this document:
>
> all %-encodings are 7-bit -> mark as APREQ_CHARSET_ASCII,
> some %-encoding is 8-bit -> divine the encoding, mark result
> as either iso-8859-1 or utf8.
>
What do you mean, Joe? To automatically convert any input to a predefined
format? Or do you mean something else?
--
__________________________________________________________________
Stas Bekman JAm_pH ------> Just Another mod_perl Hacker
http://stason.org/ mod_perl Guide ---> http://perl.apache.org
mailto:stas@stason.org http://use.perl.org http://apacheweek.com
http://modperlbook.org http://apache.org http://ticketmaster.com
Re: unicode
Posted by Joe Schaefer <jo...@sunstarsys.com>.
Max Kellermann <ma...@duempel.org> writes:
[...]
>> 4) expose a utility function which converts cp-1252 strings
>> to utf8.
>
> We should also have a flag "always UTF-8, please", which makes
> libapreq2 transparently convert everything to UTF-8.
We have that already: it's either called apreq_register_parser()
or apreq_parser_set() :-).
--
Joe Schaefer
Re: unicode
Posted by Joe Schaefer <jo...@sunstarsys.com>.
Max Kellermann <ma...@duempel.org> writes:
> On 2005/03/17 17:09, Joe Schaefer <jo...@sunstarsys.com> wrote:
>> 1) in apreq.h, add
>
> +1
>
>> 2) in apreq_param.h: replace the "utf8" stuff with
>
> +1
>
>> 3) upgrade apreq_param_decode() and apreq_param_decodev()
>> to report the charset detected (probably via the return value
>> so people can't just ignore it). The divination logic would
>> go like this:
>
> Your algorithm is probably the best we can do.. if that works out
> for all important clients, +1 from me.
Ok, the basics are in the branch now. It's based largely on
Sam Ruby's page, so please review the code carefully. I'm
not sure we've got the latin1/cp1252 stuff right yet.
--
Joe Schaefer
Re: unicode
Posted by Max Kellermann <ma...@duempel.org>.
On 2005/03/17 17:09, Joe Schaefer <jo...@sunstarsys.com> wrote:
> 1) in apreq.h, add
+1
> 2) in apreq_param.h: replace the "utf8" stuff with
+1
> 3) upgrade apreq_param_decode() and apreq_param_decodev()
> to report the charset detected (probably via the return value
> so people can't just ignore it). The divination logic would
> go like this:
Your algorithm is probably the best we can do.. if that works out for
all important clients, +1 from me.
> 4) expose a utility function which converts cp-1252 strings
> to utf8.
We should also have a flag "always UTF-8, please", which makes
libapreq2 transparently convert everything to UTF-8.
> 5) Replace the perl-glue's $param->is_utf8() method with charset().
> When we have to expose an cp-1252 encoded param to a perl user,
> we use the utility function from (4) and translate the data to utf8
> (in the SvPV; we don't modify the apreq_param_t at all).
0, because I havn't worked with the perl glue yet. But it sounds ok.
Max
Re: unicode
Posted by Joe Schaefer <jo...@sunstarsys.com>.
Joe Schaefer <jo...@sunstarsys.com> writes:
[...]
> I'm not sure how we should handle this, but two options seem obvious:
> translate that to utf8, or add windows-1252 to our charset list.
At the moment, I prefer the latter. Here's what I think will
work best for apreq; please critique.
Thanks!
==================================================
1) in apreq.h, add
/** Character encodings. */
typedef enum {
APREQ_CHARSET_ASCII =0,
APREQ_CHARSET_LATIN1 =1, /* ISO-8859-1 */
APREQ_CHARSET_CP_1252=2, /* Windows-1252 */
APREQ_CHARSET_UTF8 =8
} apreq_charset_t;
==================================================
2) in apreq_param.h: replace the "utf8" stuff with
/** Sets the character encoding for this parameter. */
static APR_INLINE
apreq_charset_t apreq_param_charset_set(apreq_param_t *p ,unsigned char c) {
unsigned char c = APREQ_FLAGS_GET(p, APREQ_CHARSET);
APREQ_FLAGS_SET(p->flags, APREQ_CHARSET, c);
return c;
}
/** Gets the character encoding for this parameter. */
static APR_INLINE
apreq_charset_t apreq_param_charset_get(apreq_param_t *p) {
return APREQ_FLAGS_GET(p->flags, APREQ_CHARSET);
}
==================================================
3) upgrade apreq_param_decode() and apreq_param_decodev()
to report the charset detected (probably via the return value
so people can't just ignore it). The divination logic would
go like this:
a) Presume the charset is 7-bit ASCII;
if that cannot possibly be true, then
b) Presume the data was utf8 encoded. If
that cannot possibly be true, then
c) Presume the data was encoded using iso-8859-1,
unless control characters (0x80 - 0x9F) appear.
d) Mark it windows-1252.
==================================================
4) expose a utility function which converts cp-1252 strings
to utf8.
==================================================
5) Replace the perl-glue's $param->is_utf8() method with charset().
When we have to expose an cp-1252 encoded param to a perl user,
we use the utility function from (4) and translate the data to utf8
(in the SvPV; we don't modify the apreq_param_t at all).
--
Joe Schaefer
Re: unicode
Posted by Joe Schaefer <jo...@sunstarsys.com>.
Max Kellermann <ma...@duempel.org> writes:
> On 2005/03/17 02:56, Stas Bekman <st...@stason.org> wrote:
>> > all %-encodings are 7-bit -> mark as APREQ_CHARSET_ASCII,
>> > some %-encoding is 8-bit -> divine the encoding, mark result
>> > as either iso-8859-1 or utf8.
>> >
>>
>> What do you mean, Joe? To automatically convert any input to a
>> predefined format? Or do you mean something else?
>
> No, just to mark the value as "ASCII" or "UTF-8", and to let the
> application decide how to handle it. No conversion.
Yup, but there's one snafu here: AFAICT the windows-1252 encodings
must be mapped to utf8 (none of the 27 chars mentioned in Sam Ruby's
survival guide translate to iso-8859-1). I'm not sure how we should
handle this, but two options seem obvious: translate that to utf8,
or add windows-1252 to our charset list.
--
Joe Schaefer
Re: unicode (was Re: today's code review..)
Posted by Stas Bekman <st...@stason.org>.
Max Kellermann wrote:
> On 2005/03/17 02:56, Stas Bekman <st...@stason.org> wrote:
>
>>> all %-encodings are 7-bit -> mark as APREQ_CHARSET_ASCII,
>>> some %-encoding is 8-bit -> divine the encoding, mark result
>>> as either iso-8859-1 or utf8.
>>>
>>
>>What do you mean, Joe? To automatically convert any input to a predefined
>>format? Or do you mean something else?
>
>
> No, just to mark the value as "ASCII" or "UTF-8", and to let the
> application decide how to handle it. No conversion.
Thanks Max, I have a long way to catch up with all the new changes...
--
__________________________________________________________________
Stas Bekman JAm_pH ------> Just Another mod_perl Hacker
http://stason.org/ mod_perl Guide ---> http://perl.apache.org
mailto:stas@stason.org http://use.perl.org http://apacheweek.com
http://modperlbook.org http://apache.org http://ticketmaster.com
Re: unicode (was Re: today's code review..)
Posted by Max Kellermann <ma...@duempel.org>.
On 2005/03/17 02:56, Stas Bekman <st...@stason.org> wrote:
> > all %-encodings are 7-bit -> mark as APREQ_CHARSET_ASCII,
> > some %-encoding is 8-bit -> divine the encoding, mark result
> > as either iso-8859-1 or utf8.
> >
>
> What do you mean, Joe? To automatically convert any input to a predefined
> format? Or do you mean something else?
No, just to mark the value as "ASCII" or "UTF-8", and to let the
application decide how to handle it. No conversion.
Max