You are viewing a plain text version of this content. The canonical link for it is here.

Posted to apreq-dev@httpd.apache.org by Markus Wichitill <ma...@gmx.de> on 2005/04/17 19:42:56 UTC

New charset support breaks existing app charset/utf8 support

Hi,

after updating to yesterday's unstable branch, the existing UTF-8 support in 
my application doesn't work anymore, which is probably since apreq2 does its 
own decoding now, and decoding twice fails.

With no documentation yet, I don't know what exactly apreq does, but is 
there a way to switch it all off? If there isn't, there probably should be, 
since my app won't be the only one that's affected by this.

Also, apreq is probably wasting performance by doing more charset guessing 
(?) than necessary for my app, which knows the charset/encoding, and 
therefore only needs to call utf8::decode with no guessing when running in 
UTF-8 mode, or simply does nothing if not running in UTF-8 mode.

Re: New charset support breaks existing app charset/utf8 support

Posted by Markus Wichitill <ma...@gmx.de>.

Joe Schaefer wrote:
> Is the current trunk any better for you?  Now the default behavior 
> should not set the SV's UTF8 flag.

I can't say I understand the change in rev 161902 and the connection with 
tainting, but it seems to work, there's no flag.

Re: New charset support breaks existing app charset/utf8 support

Posted by Joe Schaefer <jo...@sunstarsys.com>.

Markus Wichitill <ma...@gmx.de> writes:

> Joe Schaefer wrote:
>> +1 to being enctype-agnostic.  Do you know how CGI.pm behaves here?
>> I think our default behavior should mimic that as much as possible.
>
> CGI.pm doesn't set any utf8 flags. Neither do CGI::Simple nor CGI::Minimal.

Thanks, I see that.  I'm still struggling with the issues though.
Is the current trunk any better for you?  Now the default behavior 
should not set the SV's UTF8 flag.

-- 
Joe Schaefer

Re: New charset support breaks existing app charset/utf8 support

Posted by Markus Wichitill <ma...@gmx.de>.

Joe Schaefer wrote:
> +1 to being enctype-agnostic.  Do you know how CGI.pm behaves here?
> I think our default behavior should mimic that as much as possible.

CGI.pm doesn't set any utf8 flags. Neither do CGI::Simple nor CGI::Minimal.

Re: New charset support breaks existing app charset/utf8 support

Posted by Joe Schaefer <jo...@sunstarsys.com>.

Markus Wichitill <ma...@gmx.de> writes:

> Joe Schaefer wrote:

[...]

>> And of course it might hurt your own code, because that's another
>> portability condition you may need to work around.
>
> Are you still talking about the patch-level issue above,

Yes, the patch-level issue.

[...]

> So you only set the utf8 flags for parameters in GETs and POSTs with
> application/x-www-form-urlencoded, but not in POSTs with
> multipart/form-data. I don't even know how to efficiently work around
> that anymore. My param wrapper function has no idea what enctype the
> form was in, nor should it. 

+1 to being enctype-agnostic.  Do you know how CGI.pm behaves here?
I think our default behavior should mimic that as much as possible.

>
>> but whoever's doing apreq in the future can
>> optimize that away by just assuming that data is utf8.
>
> Can't do that with untrusted data from the Internets, set a utf8
> flag on invalid data, and Perl will happily crash.

Agreed.  So if we don't always validate, I think we need to 
decouple apreq's internal charset support from perl's.  Since
we want to encourage users to write parsers, I think
the best idea at this point is to just add another flag for
perl's UTF-8 flag.  If perl users want to set that flag 
themselves by marching through the param table, that'd be 
ok with me.

-- 
Joe Schaefer

Re: New charset support breaks existing app charset/utf8 support

Posted by Markus Wichitill <ma...@gmx.de>.

Joe Schaefer wrote:
> Would it make sense to not set the SV's UTF-8 flag for perl < 5.8.0?

Absolutely.

> I don't think changing the behavior within a patch-level upgrade is
> wise, so if we enable it for 5.8.3 we should enable it for 5.8.0 also.

If you insist on forcing the utf8 flags, then yes.

> And of course it might hurt your own code, because 
> that's another portability condition you may need to work around.

Are you still talking about the patch-level issue above, or my 
$apr->charset_support(0) config method proposal in general? If it's the 
latter, I don't see the logic.

> But the slowdown should only impact non-ascii data.  If there's a lot
> of that around, you probably shouldn't be url-encoding it in the first
> place. 

So you only set the utf8 flags for parameters in GETs and POSTs with
application/x-www-form-urlencoded, but not in POSTs with 
multipart/form-data. I don't even know how to efficiently work around that 
anymore. My param wrapper function has no idea what enctype the form was in, 
nor should it.

> but whoever's doing apreq in the future can
> optimize that away by just assuming that data is utf8.

Can't do that with untrusted data from the Internets, set a utf8 flag on 
invalid data, and Perl will happily crash.

Re: New charset support breaks existing app charset/utf8 support

Posted by Joe Schaefer <jo...@sunstarsys.com>.

Markus Wichitill <ma...@gmx.de> writes:

[...]

> Some more points:
>
> - Perl 5.6 may have the beginnings of UTF-8 support built-in, but it's
> buggy and there's no interface to use that functionality, so for all
> intents and purposes it doesn't support UTF-8. I'm not sure if that
> actually happens now, but you really don't want to set the flag under
> that version. 
>
> - Even Perl 5.8.0 - 5.8.2 are too buggy to be safely used with utf8-flagged
> scalars.

Would it make sense to not set the SV's UTF-8 flag for perl < 5.8.0?
I don't think changing the behavior within a patch-level upgrade is
wise, so if we enable it for 5.8.3 we should enable it for 5.8.0 also.
This would be a compile-time check, so it wouldn't hurt performance.
Making it configurable at run-time will either slow things down or add 
more bloated code.  And of course it might hurt your own code, because 
that's another portability condition you may need to work around.

> - Analyzing the paramters for encoding, when that's not required in
> many cases, seems wasteful from a performance perspective. And
> performance was always an important aspect of apreq.

But the slowdown should only impact non-ascii data.  If there's a lot
of that around, you probably shouldn't be url-encoding it in the first
place.  Besides, performance is something we can retune as the situation 
improves.  At the moment, I think it's better to spend some time validating 
non-ascii form data, but whoever's doing apreq in the future can
optimize that away by just assuming that data is utf8.

-- 
Joe Schaefer

Re: New charset support breaks existing app charset/utf8 support

Posted by Markus Wichitill <ma...@gmx.de>.

Joe Schaefer wrote:
> I'm hoping that our stuff just 
> plain-old works without any monkeying around with decoders.

In case you or other readers haven't had to deal much with utf8 flags yet, a 
few examples of what happens when a module sets utf8 flags and the 
application isn't prepared to handle that, because it's used to treat UTF-8 
like an 8-bit encoding (which is a lot simpler, although not perfect):

- Anytime you print a utf8 string, or write it to a file, Perl will warn 
about "Wide character in print at ..." and then convert it to latin1 if 
possible, effectively writing random garbage. Handling this requires setting 
:utf8 layers on handles, which is complicated by having to deal with tied IO 
and PerlIO, as in the case of mod_perl.

- If you pass a utf8 string to an XS module that doesn't handle utf8, 
there's a good chance it will do the same as above or die. Handling this 
requires plenty of ugly utf8::decode/utf8::encode pairs.

- Anytime the utf8-flagged strings are combined with other strings that are 
already in UTF-8 format, but don't have the flag, the unflagged strings are 
wrongly converted from latin1 to UTF-8 by Perl, destroying them. Handling 
this requires taking care of decoding all possible data sources early.

All in all, handling utf8-flagged strings in Perl isn't all that easy, and 
it's not made any simpler by the scattered and partly confusing Perl docs.

Some more points:

- Perl 5.6 may have the beginnings of UTF-8 support built-in, but it's buggy 
and there's no interface to use that functionality, so for all intents and 
purposes it doesn't support UTF-8. I'm not sure if that actually happens 
now, but you really don't want to set the flag under that version.

- Even Perl 5.8.0 - 5.8.2 are too buggy to be safely used with utf8-flagged 
scalars.

- Few XS modules support utf8, and this will probably never change, what 
with many modules, including important ones, being barely or not at all 
maintained. In this enviromnent, I see no point in forcing utf8 flags on users.

- Analyzing the paramters for encoding, when that's not required in many 
cases, seems wasteful from a performance perspective. And performance was 
always an important aspect of apreq.

Re: New charset support breaks existing app charset/utf8 support

Posted by Joe Schaefer <jo...@sunstarsys.com>.

Markus Wichitill <ma...@gmx.de> writes:

[...]

> I was hoping for a $apr->charset_support(1) config method that only
> switches on charset processing when requested.

That's a possibility, but actually I'm hoping that our stuff just 
plain-old works without any monkeying around with decoders.

> Perl 5.8.1+ and most CPAN modules that have added UTF-8 support don't
> set any utf8 flags unless they've been asked to, for good reason.  For
> example, all the trouble that Perl 5.8.0 had caused when it used utf8
> by default on machines with UTF-8 locales proved why utf8 flags
> shouldn't be forced on users. Things will just fail on so many
> fronts. 

Err, I want to see a bug report about what we've got now.  The newer 
specs are very clear that utf8 is the preferred encoding of anything that 
ain't ascii.  That's our model as well.  We only mark a string as 
utf8 if it's not ascii, and it's a valid utf8 bytestream.  And if turns
out that neither is true, we also mark it as such.  I won't do the wrong
thing just people are used to doing the wrong thing.

-- 
Joe Schaefer

Re: New charset support breaks existing app charset/utf8 support

Posted by Markus Wichitill <ma...@gmx.de>.

Joe Schaefer wrote:
> Use $param->charset() to see what apreq thinks the charset is.
> You need something like (untested)
> 
>     $body = $req->body;
>     $body->param_class("APR::Request::Param");
>     my $funky_param = $body->{foo};
>     print "charset = ", $funky_param->charset; # 0 = ascii, 8 = utf8
> 
> apreq just marks the charset.  IIRC (I have to run now, sorry if this is
> wrong) you should be able to turn off the utf8 stuff via:
> 
>     $funky_param->charset(0);

Having to deal with Param objects when I only want to fetch a simple string 
via $apr->param() seems overly complicated to me. I was hoping for a 
$apr->charset_support(1) config method that only switches on charset 
processing when requested.

Perl 5.8.1+ and most CPAN modules that have added UTF-8 support don't set 
any utf8 flags unless they've been asked to, for good reason. For example, 
all the trouble that Perl 5.8.0 had caused when it used utf8 by default on 
machines with UTF-8 locales proved why utf8 flags shouldn't be forced on 
users. Things will just fail on so many fronts.

> without using utf8::decode and see if apreq isn't guessing the charset
> right anyways.

It does, but I'd still rather use utf8::decode myself, because then I don't 
have to do different error handling etc. for apreq2, apreq1 and the pureperl 
parser for non-mod_perl platforms. And I don't even want to know what 
happens if apreq2 sets utf8 flags when running under Perl 5.6.

Re: New charset support breaks existing app charset/utf8 support

Posted by Joe Schaefer <jo...@sunstarsys.com>.

Markus Wichitill <ma...@gmx.de> writes:

> after updating to yesterday's unstable branch, the existing UTF-8
> support in my application doesn't work anymore, which is probably
> since apreq2 does its own decoding now, and decoding twice fails.

Use $param->charset() to see what apreq thinks the charset is.
You need something like (untested)

    $body = $req->body;
    $body->param_class("APR::Request::Param");
    my $funky_param = $body->{foo};
    print "charset = ", $funky_param->charset; # 0 = ascii, 8 = utf8

> With no documentation yet, I don't know what exactly apreq does, but
> is there a way to switch it all off? 

apreq just marks the charset.  IIRC (I have to run now, sorry if this is
wrong) you should be able to turn off the utf8 stuff via:

    $funky_param->charset(0);

[...]

> Also, apreq is probably wasting performance by doing more charset
> guessing (?) than necessary for my app, which knows the
> charset/encoding, and therefore only  needs to call utf8::decode with
> no guessing when running in UTF-8 mode, or simply does nothing if not
> running in UTF-8 mode. 

It only validates utf8, it does not "decode" anything.  Try running
without using utf8::decode and see if apreq isn't guessing the charset
right anyways.

-- 
Joe Schaefer