You are viewing a plain text version of this content. The canonical link for it is here.
Posted to modperl@perl.apache.org by Steve Hay <st...@uk.radan.com> on 2003/07/11 15:06:51 UTC
Undocumented behaviour in Apache->print()?
Hi,
I've just spent quite a while tracking down a problem with a web page
generated by a mod_perl program in which 8-bit ISO-8859-1 characters
were not being shown properly. The software runs via Apache::Registry,
and works fine under mod_cgi.
It turns out that the problem is due to a difference in behaviour
between Perl's built-in print() function in Perl 5.8.0+ and the
Apache->print() method that mod_perl overrides it with. I've consulted
the documentation on the mod_perl website, and could find no mention of
the difference. If my conclusions below are correct then this
information may well be worth adding.
Under Perl 5.8.0, if a string stored in Perl's internal UTF-8 format is
passed to print() then by default it will be converted to the machine's
native 8-bit character set on output to STDOUT. In my case, this is
exactly as if I had called binmode(STDOUT, ':encoding(iso-8859-1)')
before the print(). (If any characters in the UTF-8 string are not
representable in ISO-8859-1 then a "Wide character in print()" warning
will be emitted, and the bytes that make up that UTF-8 character will be
output.)
However, mod_perl's Apache->print() method does not perform this
automatic conversion. It simply prints the bytes that make up each
UTF-8 character (i.e. it outputs the UTF-8 string as UTF-8), exactly as
if you had called binmode(STDOUT, ':utf8') before Apache->print(). (No
"Wide character in print()" warnings are produced for charcaters with
code points > 0xFF either.)
The test program below illustrates this difference:
use 5.008;
use strict;
use warnings;
use Encode;
my $cset = 'ISO-8859-1';
#my $cset = 'UTF-8';
print "Content-type: text/html; charset=$cset\n\n";
print <<EOT;
<html>
<head>
<meta http-equiv="Content-type" content="text/html; charset=$cset">
</head>
<body>
EOT
# $str is stored in Perl's internal UTF-8 format.
my $str = Encode::decode('iso-8859-1', 'Zurück');
print "<p>$str</p>\n";
print <<EOT;
</body>
</html>
EOT
Running under mod_cgi (using Perl's built-in print() function) the UTF-8
encoded data in $str is converted to ISO-8859-1 on-the-fly by the
print(), and the end-user will see the intended output when $cset is
ISO-8859-1. (Changing $cset to UTF-8 causes the ü to be replaced with ?
in my web browser because the ü which is output is not a valid UTF-8
character (which the output is labelled as).)
Running under mod_perl (with Perl's built-in print() function now
overridden by the Apache->print() method) the UTF-8 encoded data in $str
is NOT converted to ISO-8859-1 on-the-fly as it is printed, and the
end-user will see the two bytes that make up the UTF-8 representation of
ü when $cset is ISO-8859-1. Changing $cset to UTF-8 in this case
"fixes" it, because the output stream in this case happens to be valid
UTF-8 all the way through.
There are two solutions to this:
1. To use $cset = 'ISO-8859-1': Explicitly convert the UTF-8 data in
$str to ISO-8859-1 yourself before sending it to print(), rather than
relying on print() to do that for you. This is, in general, not
possible (not all characters in the UTF-8 string may be representable in
ISO-8859-1), but for HTML output we can arrange for Encode::encode to
convert any non-representable charcaters to their HTML character references:
$str = Encode::encode('iso-8859-1', $str, Encode::FB_HTMLCREF);
2. To use $cset = 'UTF-8': Output UTF-8 directly, ensuring that *all*
outgoing data is UTF-8 by adding an appropriate layer on STDOUT:
binmode STDOUT, ':utf8';
The second method here is generally to be preferred, but in the old
software that I was experiencing problems with, I was not able to add
the utf8 layer to STDOUT reliably (the data was being output from a
multitude of print() statements scattered in various places), so I stuck
with the first method. I believed that it should work without the
explicit encoding to ISO-8859-1 because I was unaware that mod_perl's
print() override removed Perl's implicit encoding behaviour. Actually,
the explicit encoding above is better anyway because it also handles
characters that can't be encoded to ISO-8859-1, but nevertheless I think
the difference in mod_perl's print() is still worth mentioning in the
documentation somewhere.
Cheers,
Steve
Re: Undocumented behaviour in Apache->print()?
Posted by Stas Bekman <st...@stason.org>.
Steve Hay wrote:
>> 5.8.0 is a pretty new perl version, which provides the new
>> functionality, and it seems that hardly anybody has been using the UTF
>> stuff with mod_perl.
>
>
> 5.8.0 is actually a couple of days short of being one year old (happy
> birthday!), which is increasingly not that new any more. 5.8.1 should
> be out soon too.
I meant that it was too new to be embraced by the crowd. it'll probably take a
few more years before this will happen. In any case, this is just an excuse ;)
> As for hardly anybody using UTF8 stuff with mod_perl... I didn't think
> that I was until I realised that most XML parsers (certainly the two
> that I most uses -- XML::LibXML and XML::DOM) return all their data in
> Perl's internal UTF-8 format! Then the penny dropped that I was
> actually using it rather a lot :-)
I thought XML was dead. Do people still use this archaic technology? I went to
this session at this OS conference with many k00l ppls and there was this
dude[1] who said that YAML is the future. Next they started talking about
animals, and for some reason everybody liked ponie. All well, orange people
[2], orange sites [3], orange ponies [4], jetlag, too many flights, too little
sleep...
1:
http://husk.org/pics/imgs/people/perl/london.pm_ingy_2001-07-30/ingy_nino_tired.jpg
2:
http://husk.org/pics/imgs/people/perl/london.pm_ingy_2001-07-30/acme_perl_hacker_scary.jpg
3: http://search.cpan.org/
4: http://ponie.kwiki.org/ http://www.poniecode.org/
;)
__________________________________________________________________
Stas Bekman JAm_pH ------> Just Another mod_perl Hacker
http://stason.org/ mod_perl Guide ---> http://perl.apache.org
mailto:stas@stason.org http://use.perl.org http://apacheweek.com
http://modperlbook.org http://apache.org http://ticketmaster.com
Re: Undocumented behaviour in Apache->print()?
Posted by Steve Hay <st...@uk.radan.com>.
Steve Hay wrote:
> Stas Bekman wrote:
>
>
>> > I have attempted to shoe-horn this into mod_perl's print() method (in
>> > "src/modules/perl/Apache.xs"). Here's the diff against mod_perl 1.28:
>> > [Unfortunately, I've had to comment-out the first part of that "if"
>> > block, because I got an unresolved external symbol error relating
>> to the
>> > PerlIO_isutf8() function otherwise (which may be because that function
>> > isn't documented in the perlapio manpage).]
>>
>> mod_perl 1.x doesn't use perlio, hence you have this problem. adding:
>>
>> #include "perlio.h"
>>
>> should resolve it I think.
>
>
> No. The error was "unresolved external symbol", which means that the
> compiler is happy (it evidently has pulled in perlio.h, or something
> else that declares PerlIO_isutf8() as "extern ..."), but that the
> linker couldn't find the definition of that function.
>
> (Check: If I change "PerlIO_isutf8" to "PerlIO_isutf" (deliberate
> typo) then I get a different error - "undefined; assuming extern
> returning int" - because now no declaration has been supplied.)
>
> Listing the symbols exported from perl58.lib shows that PerlIO_isutf8
> is *not* one of them. So where's the definition supposed to come from?
>
> I'll ask about this on the perlxs mailing list, I think.
Having asked about this, it turns out that the problem was
PerlIO_isutf8() not being exported from perl58.lib on Windows (and other
platforms where the symbols to export have to be explicitly listed).
I sent a patch off to p5p which fixes this
(http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2003-07/msg01096.html),
and it has been integrated as #20203.
So presumably this will not be a problem in the upcoming perl-5.8.1, but
how do we cope with the perl-5.8.0 case?
I've attached a new patch (against mod_perl-1.28) which (I believe)
fixes the UTF-8 problem, but it won't build on Windows with perl-5.8.0
without #20203.
Steve
Re: Undocumented behaviour in Apache->print()?
Posted by Steve Hay <st...@uk.radan.com>.
Stas Bekman wrote:
> Steve Hay wrote:
>
>>
>> It's only Perl 5.8 that has the special "UTF-8 flag" which the
>> functions above all operate with respect to. If a Perl variable
>> contains a sequence of bytes that make up a valid UTF-8 character,
>> but the string is not flagged with Perl's special flag, then Perl's
>> built-in print() doesn't do this automatic conversion anyway.
>
>
> Yes.
>
> Apps wanting to handle utf will need to 'require 5.008;' as in your
> example.
>
>> IOW,
>>
>> print "Content-type: text/plain\n\n";
>> $a = "\xC3\xBC";
>> print $a;
>>
>> retrieved from a mod_cgi server produces (via od -b / od -c):
>>
>> 0000000 303 274
>> 0000002
>
>
> yup, because you need to add utf8::decode($a); before printing $a.
> Which your version does as well.
(Indeed. I meant it as example of how Perl's (5.8's) print() doesn't do
the conversion on strings that are not *flagged* as UTF-8, even when
they make valid UTF-8.)
>
>
>> Perl 5.6 and older don't have the UTF-8 flag and hence don't do any
>> automatic conversion via print(). Therefore, mod_perl's print()
>> should not have the difference from Perl's print() that exists in
>> 5.8, so no change should be required.
>>
>> Sure enough, looking at the "doio.c" source file in Perl 5.6.1, the
>> entire chunk of code that I half-inched above is not present.
>
>
> So you suggest that we copy this functionality from Perl. So if need
> to #ifdef it for 5.8.0.
So I'll add
#if PERL_VERSION >= 8
...
#endif
around the code that I've added.
>
>
> > I have attempted to shoe-horn this into mod_perl's print() method (in
> > "src/modules/perl/Apache.xs"). Here's the diff against mod_perl 1.28:
> > [Unfortunately, I've had to comment-out the first part of that "if"
> > block, because I got an unresolved external symbol error relating to
> the
> > PerlIO_isutf8() function otherwise (which may be because that function
> > isn't documented in the perlapio manpage).]
> >
> > --- Apache.xs.orig 2003-06-06 12:31:10.000000000 +0100
> > +++ Apache.xs 2003-07-15 12:20:42.000000000 +0100
> > @@ -1119,12 +1119,25 @@
> > SV *sv = sv_newmortal();
> > SV *rp = ST(0);
> > SV *sendh = perl_get_sv("Apache::__SendHeader", TRUE);
> > + /*PerlIO *fp = PerlIO_stdout();*/
> >
> > if(items > 2)
> > do_join(sv, &sv_no, MARK+1, SP); /* $sv = join '',
> @_[1..$#_] */
> > else
> > sv_setsv(sv, ST(1));
> >
> > + /*if (PerlIO_isutf8(fp)) {
> > + if (!SvUTF8(sv))
> > + sv_utf8_upgrade(sv = sv_mortalcopy(sv));
> > + }
> > + else*/ if (DO_UTF8(sv)) {
> > + if (!sv_utf8_downgrade((sv = sv_mortalcopy(sv)), TRUE)
> > + && ckWARN_d(WARN_UTF8))
> > + {
> > + Perl_warner(aTHX_ packWARN(WARN_UTF8), "Wide character in
> print");
> > + }
> > + }
> > +
> > PUSHMARK(sp);
> > XPUSHs(rp);
> > XPUSHs(sv);
> >
> > Besides the problem with PerlIO_isutf8(),
>
> mod_perl 1.x doesn't use perlio, hence you have this problem. adding:
>
> #include "perlio.h"
>
> should resolve it I think.
No. The error was "unresolved external symbol", which means that the
compiler is happy (it evidently has pulled in perlio.h, or something
else that declares PerlIO_isutf8() as "extern ..."), but that the linker
couldn't find the definition of that function.
(Check: If I change "PerlIO_isutf8" to "PerlIO_isutf" (deliberate typo)
then I get a different error - "undefined; assuming extern returning
int" - because now no declaration has been supplied.)
Listing the symbols exported from perl58.lib shows that PerlIO_isutf8 is
*not* one of them. So where's the definition supposed to come from?
I'll ask about this on the perlxs mailing list, I think.
>
>
> > there are other problems that
> > spring to my mind straight away with this:
> > - is getting the PerlIO * for STDOUT to right thing to be doing anyway?
>
> PerlIO *fp = IoOFP(GvIOp(defoutgv))
Seems to work OK for me. What's defoutgv?
>
>
> > - if "items > 2", do we need to handle the UTF8-ness of each of those
> > items individually before we join them?
>
> I'm not sure, how perl handles this?
Struggling as best as I can to read pp_print() in Perl's "pp_hot.c", it
looks like Perl calls do_print() (which contains the UTF-8 handling that
I've stolen) for each item in the list that is passed to it.
Considering this more, I think that it probably isn't an issue: if you
have two variables in Perl, one of which is flagged UTF-8 and the other
of which isn't, then when you concatenate them, the whole is "upgraded"
to flagged UTF-8 anyway.
However, it has occurred to me that I've missed out adding the UTF-8
handling to half of mod_perl's print() method!
The method is split into two halves:
if (!mod_perl_sent_header(r, 0)) {
...
} else {
...
}
and I've only handled the first half!
The first half joins all of the items together and then calls
send_cgi_header(). That outputs everything down to the first blank line
(i.e. all the headers), then sets the "sent headers" flag and recurses
on $r->print(). Next time around, we'll enter the second half, which
simply calls write_client().
If we've already been through the first half then the UTF-8 conversion
will have been applied already, but if we come straight into the second
half (i.e. by printing the headers and the body separately) then the
UTF-8 conversion will not yet have been applied. So as my patch stands,
use utf8;
$a = "\xC3\xBC";
utf8::decode($a);
print "Content-type: text/plain\n\n", $a;
will have the UTF-8 data in $a handled, but
use utf8;
$a = "\xC3\xBC";
utf8::decode($a);
print "Content-type: text/plain\n\n";
print $a;
will not!
The write_client() method appears to call rwrite() (Apache's
ap_rwrite()?) for each item in the list that is passed to it, so I
suppose I should also add the UTF-8 handling code to each of those items
too. (This means that if the headers and body *are* printed together
then the body will be UTF-8-handled twice -- once in the first half of
print(), and then again in write_client(). However, that's "safe": the
handling just ensures that the data is in the appropriate format. It
knows not to do anything if it is already in the correct format.)
I've attached a patch that incorporates these changes (with the
PerlIO_isutf8() stuff still commented out until I figure out what to do
about it).
Steve
Re: Undocumented behaviour in Apache->print()?
Posted by Stas Bekman <st...@stason.org>.
Steve Hay wrote:
> Stas Bekman wrote:
>
>>> I have attempted to shoe-horn this into mod_perl's print() method (in
>>> "src/modules/perl/Apache.xs"). Here's the diff against mod_perl
>>> 1.28: [Unfortunately, I've had to comment-out the first part of that
>>> "if" block, because I got an unresolved external symbol error
>>> relating to the PerlIO_isutf8() function otherwise (which may be
>>> because that function isn't documented in the perlapio manpage).]
>>>
>>> --- Apache.xs.orig 2003-06-06 12:31:10.000000000 +0100
>>> +++ Apache.xs 2003-07-15 12:20:42.000000000 +0100
>>> @@ -1119,12 +1119,25 @@
>>> SV *sv = sv_newmortal();
>>> SV *rp = ST(0);
>>> SV *sendh = perl_get_sv("Apache::__SendHeader", TRUE);
>>> + /*PerlIO *fp = PerlIO_stdout();*/
>>>
>>> if(items > 2)
>>> do_join(sv, &sv_no, MARK+1, SP); /* $sv = join '', @_[1..$#_] */
>>> else
>>> sv_setsv(sv, ST(1));
>>>
>>> + /*if (PerlIO_isutf8(fp)) {
>>> + if (!SvUTF8(sv))
>>> + sv_utf8_upgrade(sv = sv_mortalcopy(sv));
>>> + }
>>> + else*/ if (DO_UTF8(sv)) {
>>> + if (!sv_utf8_downgrade((sv = sv_mortalcopy(sv)), TRUE)
>>> + && ckWARN_d(WARN_UTF8))
>>> + {
>>> + Perl_warner(aTHX_ packWARN(WARN_UTF8), "Wide character in
>>> print");
>>> + }
>>> + }
>>> +
>>> PUSHMARK(sp);
>>> XPUSHs(rp);
>>> XPUSHs(sv);
>>>
>>> Besides the problem with PerlIO_isutf8(), there are other problems
>>> that spring to my mind straight away with this:
>>> - is getting the PerlIO * for STDOUT to right thing to be doing anyway?
>>> - if "items > 2", do we need to handle the UTF8-ness of each of those
>>> items individually before we join them?
>>> - we need to code this in such a way as to remain backwards
>>> compatible with older Perls.
>>
>>
>>
>> looks like this is the main question. Do we handle utf8 only for perl
>> 5.8?
>
>
> It's only Perl 5.8 that has the special "UTF-8 flag" which the functions
> above all operate with respect to. If a Perl variable contains a
> sequence of bytes that make up a valid UTF-8 character, but the string
> is not flagged with Perl's special flag, then Perl's built-in print()
> doesn't do this automatic conversion anyway.
Yes.
Apps wanting to handle utf will need to 'require 5.008;' as in your example.
> IOW,
>
> print "Content-type: text/plain\n\n";
> $a = "\xC3\xBC";
> print $a;
>
> retrieved from a mod_cgi server produces (via od -b / od -c):
>
> 0000000 303 274
> 0000002
yup, because you need to add utf8::decode($a); before printing $a. Which your
version does as well.
> Perl 5.6 and older don't have the UTF-8 flag and hence don't do any
> automatic conversion via print(). Therefore, mod_perl's print() should
> not have the difference from Perl's print() that exists in 5.8, so no
> change should be required.
>
> Sure enough, looking at the "doio.c" source file in Perl 5.6.1, the
> entire chunk of code that I half-inched above is not present.
So you suggest that we copy this functionality from Perl. So if need to #ifdef
it for 5.8.0.
> I have attempted to shoe-horn this into mod_perl's print() method (in
> "src/modules/perl/Apache.xs"). Here's the diff against mod_perl 1.28:
> [Unfortunately, I've had to comment-out the first part of that "if"
> block, because I got an unresolved external symbol error relating to the
> PerlIO_isutf8() function otherwise (which may be because that function
> isn't documented in the perlapio manpage).]
>
> --- Apache.xs.orig 2003-06-06 12:31:10.000000000 +0100
> +++ Apache.xs 2003-07-15 12:20:42.000000000 +0100
> @@ -1119,12 +1119,25 @@
> SV *sv = sv_newmortal();
> SV *rp = ST(0);
> SV *sendh = perl_get_sv("Apache::__SendHeader", TRUE);
> + /*PerlIO *fp = PerlIO_stdout();*/
>
> if(items > 2)
> do_join(sv, &sv_no, MARK+1, SP); /* $sv = join '', @_[1..$#_] */
> else
> sv_setsv(sv, ST(1));
>
> + /*if (PerlIO_isutf8(fp)) {
> + if (!SvUTF8(sv))
> + sv_utf8_upgrade(sv = sv_mortalcopy(sv));
> + }
> + else*/ if (DO_UTF8(sv)) {
> + if (!sv_utf8_downgrade((sv = sv_mortalcopy(sv)), TRUE)
> + && ckWARN_d(WARN_UTF8))
> + {
> + Perl_warner(aTHX_ packWARN(WARN_UTF8), "Wide character in print");
> + }
> + }
> +
> PUSHMARK(sp);
> XPUSHs(rp);
> XPUSHs(sv);
>
> Besides the problem with PerlIO_isutf8(),
mod_perl 1.x doesn't use perlio, hence you have this problem. adding:
#include "perlio.h"
should resolve it I think.
> there are other problems that
> spring to my mind straight away with this:
> - is getting the PerlIO * for STDOUT to right thing to be doing anyway?
PerlIO *fp = IoOFP(GvIOp(defoutgv))
> - if "items > 2", do we need to handle the UTF8-ness of each of those
> items individually before we join them?
I'm not sure, how perl handles this?
__________________________________________________________________
Stas Bekman JAm_pH ------> Just Another mod_perl Hacker
http://stason.org/ mod_perl Guide ---> http://perl.apache.org
mailto:stas@stason.org http://use.perl.org http://apacheweek.com
http://modperlbook.org http://apache.org http://ticketmaster.com
Re: Undocumented behaviour in Apache->print()?
Posted by Steve Hay <st...@uk.radan.com>.
Stas Bekman wrote:
>> I have attempted to shoe-horn this into mod_perl's print() method (in
>> "src/modules/perl/Apache.xs"). Here's the diff against mod_perl
>> 1.28: [Unfortunately, I've had to comment-out the first part of that
>> "if" block, because I got an unresolved external symbol error
>> relating to the PerlIO_isutf8() function otherwise (which may be
>> because that function isn't documented in the perlapio manpage).]
>>
>> --- Apache.xs.orig 2003-06-06 12:31:10.000000000 +0100
>> +++ Apache.xs 2003-07-15 12:20:42.000000000 +0100
>> @@ -1119,12 +1119,25 @@
>> SV *sv = sv_newmortal();
>> SV *rp = ST(0);
>> SV *sendh = perl_get_sv("Apache::__SendHeader", TRUE);
>> + /*PerlIO *fp = PerlIO_stdout();*/
>>
>> if(items > 2)
>> do_join(sv, &sv_no, MARK+1, SP); /* $sv = join '', @_[1..$#_] */
>> else
>> sv_setsv(sv, ST(1));
>>
>> + /*if (PerlIO_isutf8(fp)) {
>> + if (!SvUTF8(sv))
>> + sv_utf8_upgrade(sv = sv_mortalcopy(sv));
>> + }
>> + else*/ if (DO_UTF8(sv)) {
>> + if (!sv_utf8_downgrade((sv = sv_mortalcopy(sv)), TRUE)
>> + && ckWARN_d(WARN_UTF8))
>> + {
>> + Perl_warner(aTHX_ packWARN(WARN_UTF8), "Wide character in
>> print");
>> + }
>> + }
>> +
>> PUSHMARK(sp);
>> XPUSHs(rp);
>> XPUSHs(sv);
>>
>> Besides the problem with PerlIO_isutf8(), there are other problems
>> that spring to my mind straight away with this:
>> - is getting the PerlIO * for STDOUT to right thing to be doing anyway?
>> - if "items > 2", do we need to handle the UTF8-ness of each of those
>> items individually before we join them?
>> - we need to code this in such a way as to remain backwards
>> compatible with older Perls.
>
>
> looks like this is the main question. Do we handle utf8 only for perl
> 5.8?
It's only Perl 5.8 that has the special "UTF-8 flag" which the functions
above all operate with respect to. If a Perl variable contains a
sequence of bytes that make up a valid UTF-8 character, but the string
is not flagged with Perl's special flag, then Perl's built-in print()
doesn't do this automatic conversion anyway.
IOW,
print "Content-type: text/plain\n\n";
$a = "\xC3\xBC";
print $a;
retrieved from a mod_cgi server produces (via od -b / od -c):
0000000 303 274
0000002
Perl 5.6 and older don't have the UTF-8 flag and hence don't do any
automatic conversion via print(). Therefore, mod_perl's print() should
not have the difference from Perl's print() that exists in 5.8, so no
change should be required.
Sure enough, looking at the "doio.c" source file in Perl 5.6.1, the
entire chunk of code that I half-inched above is not present.
Steve
Re: Undocumented behaviour in Apache->print()?
Posted by Stas Bekman <st...@stason.org>.
[putting the test case on the top]
Steve Hay wrote:
>> In any case a simple test that reproduces the problem will be needed.
>
>
> This test program reproduces the problem:
>
> use 5.008;
> use Encode;
> print "Content-type: text/plain\n\n", decode('iso-8859-1', 'ü');
>
> Use LWP's "get" program to get that from an Apache/mod_cgi setup, run it
> through UNIX's "od -c" (get http://localhost/cgi-bin/test.pl | od -c)
> and you get:
>
> 0000000 374
> 0000001
>
> Try the same from an Apache/mod_perl setup and you get:
>
> 0000000 303 274
> 0000002
>
> i.e. the double-byte UTF-8 character representing ü that has been output
> is converted back to ü by Perl's print() [ü is character 252, octal
> 374], but is left as the two bytes by Apache's print().
>
> I've actually re-built my mod_perl using the half-formed patch given
> above and it fixes this particular test case!
On my linux box it's 'od -b', 'od -c' prints the actual ascii chars.
I've tested mp2 and it has the same problem. I've used a different version of
your test:
#!/usr/bin/perl -w
use utf8;
print "Content-type: text/plain\n\n";
$a = "\xC3\xBC";
utf8::decode($a); print $a;
which gives the same char, as in:
% perl -le '$a = "\xC3\xBC"; use utf8; utf8::decode($a); print $a;'
ü
mod_perl 1.0 and 2.0 respond with:
GET 'http://localhost:8002/cgi-bin/test.pl' | od -b
0000000 303 274
and moc_cgi with
0000000 374
> Hmm. We really need somebody who understands the internals of Perl and
> mod_perl better than me, but here's a first stab at it:
>
> The Perl source code contains a pp_print() function in "pp_hot.c" which
> I presume is basically CORE::print(). It makes use of a do_print()
> function. I think that function comes from "doio.c", although it's
> actually called Perl_do_print() there. That function does some stuff
> with the UTF-8 flag, which I guess is the sort of thing that we're
> after. Here's a chunk of Perl_do_print() from Perl 5.8.0:
>
> if (PerlIO_isutf8(fp)) {
> if (!SvUTF8(sv))
> sv_utf8_upgrade(sv = sv_mortalcopy(sv));
> }
> else if (DO_UTF8(sv)) {
> if (!sv_utf8_downgrade((sv = sv_mortalcopy(sv)), TRUE)
> && ckWARN_d(WARN_UTF8))
> {
> Perl_warner(aTHX_ packWARN(WARN_UTF8), "Wide character in print");
> }
> }
>
> I think what this does is look to see if the fp (a PerlIO *) has the
> ":utf8" encoding layer. If so, then it upgrades the sv to UTF8 (which
> is always possible). If not, then it looks to see if the "bytes" pragma
> is enabled. If not, then it downgrades the sv from UTF8 (which is not
> always possible -- if that fails and the UTF8 warnings category is
> enabled then it outputs the good ol' "Wide character in print" warning).
>
> I have attempted to shoe-horn this into mod_perl's print() method (in
> "src/modules/perl/Apache.xs"). Here's the diff against mod_perl 1.28:
> [Unfortunately, I've had to comment-out the first part of that "if"
> block, because I got an unresolved external symbol error relating to the
> PerlIO_isutf8() function otherwise (which may be because that function
> isn't documented in the perlapio manpage).]
>
> --- Apache.xs.orig 2003-06-06 12:31:10.000000000 +0100
> +++ Apache.xs 2003-07-15 12:20:42.000000000 +0100
> @@ -1119,12 +1119,25 @@
> SV *sv = sv_newmortal();
> SV *rp = ST(0);
> SV *sendh = perl_get_sv("Apache::__SendHeader", TRUE);
> + /*PerlIO *fp = PerlIO_stdout();*/
>
> if(items > 2)
> do_join(sv, &sv_no, MARK+1, SP); /* $sv = join '', @_[1..$#_] */
> else
> sv_setsv(sv, ST(1));
>
> + /*if (PerlIO_isutf8(fp)) {
> + if (!SvUTF8(sv))
> + sv_utf8_upgrade(sv = sv_mortalcopy(sv));
> + }
> + else*/ if (DO_UTF8(sv)) {
> + if (!sv_utf8_downgrade((sv = sv_mortalcopy(sv)), TRUE)
> + && ckWARN_d(WARN_UTF8))
> + {
> + Perl_warner(aTHX_ packWARN(WARN_UTF8), "Wide character in print");
> + }
> + }
> +
> PUSHMARK(sp);
> XPUSHs(rp);
> XPUSHs(sv);
>
> Besides the problem with PerlIO_isutf8(), there are other problems that
> spring to my mind straight away with this:
> - is getting the PerlIO * for STDOUT to right thing to be doing anyway?
> - if "items > 2", do we need to handle the UTF8-ness of each of those
> items individually before we join them?
> - we need to code this in such a way as to remain backwards compatible
> with older Perls.
looks like this is the main question. Do we handle utf8 only for perl 5.8?
__________________________________________________________________
Stas Bekman JAm_pH ------> Just Another mod_perl Hacker
http://stason.org/ mod_perl Guide ---> http://perl.apache.org
mailto:stas@stason.org http://use.perl.org http://apacheweek.com
http://modperlbook.org http://apache.org http://ticketmaster.com
Re: Undocumented behaviour in Apache->print()?
Posted by Steve Hay <st...@uk.radan.com>.
Hi Stas,
Stas Bekman wrote:
> Steve Hay wrote:
>
>> Hi,
>>
>> I've just spent quite a while tracking down a problem with a web page
>> generated by a mod_perl program in which 8-bit ISO-8859-1 characters
>> were not being shown properly. The software runs via
>> Apache::Registry, and works fine under mod_cgi.
>>
>> It turns out that the problem is due to a difference in behaviour
>> between Perl's built-in print() function in Perl 5.8.0+ and the
>> Apache->print() method that mod_perl overrides it with. I've
>> consulted the documentation on the mod_perl website, and could find
>> no mention of the difference. If my conclusions below are correct
>> then this information may well be worth adding.
>
>
> [the rest of the very detailed analysis has been snipped]
>
> 5.8.0 is a pretty new perl version, which provides the new
> functionality, and it seems that hardly anybody has been using the UTF
> stuff with mod_perl.
5.8.0 is actually a couple of days short of being one year old (happy
birthday!), which is increasingly not that new any more. 5.8.1 should
be out soon too.
As for hardly anybody using UTF8 stuff with mod_perl... I didn't think
that I was until I realised that most XML parsers (certainly the two
that I most uses -- XML::LibXML and XML::DOM) return all their data in
Perl's internal UTF-8 format! Then the penny dropped that I was
actually using it rather a lot :-)
> So I suppose you are the first one to hit the problem. Certainly we
> need to update mod_perl to handle this correctly. Would you be
> interested to try to make Apache->print() do the right thing?
Hmm. We really need somebody who understands the internals of Perl and
mod_perl better than me, but here's a first stab at it:
The Perl source code contains a pp_print() function in "pp_hot.c" which
I presume is basically CORE::print(). It makes use of a do_print()
function. I think that function comes from "doio.c", although it's
actually called Perl_do_print() there. That function does some stuff
with the UTF-8 flag, which I guess is the sort of thing that we're
after. Here's a chunk of Perl_do_print() from Perl 5.8.0:
if (PerlIO_isutf8(fp)) {
if (!SvUTF8(sv))
sv_utf8_upgrade(sv = sv_mortalcopy(sv));
}
else if (DO_UTF8(sv)) {
if (!sv_utf8_downgrade((sv = sv_mortalcopy(sv)), TRUE)
&& ckWARN_d(WARN_UTF8))
{
Perl_warner(aTHX_ packWARN(WARN_UTF8), "Wide character in print");
}
}
I think what this does is look to see if the fp (a PerlIO *) has the
":utf8" encoding layer. If so, then it upgrades the sv to UTF8 (which
is always possible). If not, then it looks to see if the "bytes" pragma
is enabled. If not, then it downgrades the sv from UTF8 (which is not
always possible -- if that fails and the UTF8 warnings category is
enabled then it outputs the good ol' "Wide character in print" warning).
I have attempted to shoe-horn this into mod_perl's print() method (in
"src/modules/perl/Apache.xs"). Here's the diff against mod_perl 1.28:
[Unfortunately, I've had to comment-out the first part of that "if"
block, because I got an unresolved external symbol error relating to the
PerlIO_isutf8() function otherwise (which may be because that function
isn't documented in the perlapio manpage).]
--- Apache.xs.orig 2003-06-06 12:31:10.000000000 +0100
+++ Apache.xs 2003-07-15 12:20:42.000000000 +0100
@@ -1119,12 +1119,25 @@
SV *sv = sv_newmortal();
SV *rp = ST(0);
SV *sendh = perl_get_sv("Apache::__SendHeader", TRUE);
+ /*PerlIO *fp = PerlIO_stdout();*/
if(items > 2)
do_join(sv, &sv_no, MARK+1, SP); /* $sv = join '', @_[1..$#_] */
else
sv_setsv(sv, ST(1));
+ /*if (PerlIO_isutf8(fp)) {
+ if (!SvUTF8(sv))
+ sv_utf8_upgrade(sv = sv_mortalcopy(sv));
+ }
+ else*/ if (DO_UTF8(sv)) {
+ if (!sv_utf8_downgrade((sv = sv_mortalcopy(sv)), TRUE)
+ && ckWARN_d(WARN_UTF8))
+ {
+ Perl_warner(aTHX_ packWARN(WARN_UTF8), "Wide character in print");
+ }
+ }
+
PUSHMARK(sp);
XPUSHs(rp);
XPUSHs(sv);
Besides the problem with PerlIO_isutf8(), there are other problems that
spring to my mind straight away with this:
- is getting the PerlIO * for STDOUT to right thing to be doing anyway?
- if "items > 2", do we need to handle the UTF8-ness of each of those
items individually before we join them?
- we need to code this in such a way as to remain backwards compatible
with older Perls.
Anyway, it's a start.
> If not, we should log it in the STATUS file and hopefully someone will
> have the time and inclination to solve it.
Hopefully the above stab at it will encourage somebody to have a serious
look.
>
>
> In any case a simple test that reproduces the problem will be needed.
This test program reproduces the problem:
use 5.008;
use Encode;
print "Content-type: text/plain\n\n", decode('iso-8859-1', 'ü');
Use LWP's "get" program to get that from an Apache/mod_cgi setup, run it
through UNIX's "od -c" (get http://localhost/cgi-bin/test.pl | od -c)
and you get:
0000000 374
0000001
Try the same from an Apache/mod_perl setup and you get:
0000000 303 274
0000002
i.e. the double-byte UTF-8 character representing ü that has been output
is converted back to ü by Perl's print() [ü is character 252, octal
374], but is left as the two bytes by Apache's print().
I've actually re-built my mod_perl using the half-formed patch given
above and it fixes this particular test case!
Steve
Re: Undocumented behaviour in Apache->print()?
Posted by Stas Bekman <st...@stason.org>.
Steve Hay wrote:
> Hi,
>
> I've just spent quite a while tracking down a problem with a web page
> generated by a mod_perl program in which 8-bit ISO-8859-1 characters
> were not being shown properly. The software runs via Apache::Registry,
> and works fine under mod_cgi.
>
> It turns out that the problem is due to a difference in behaviour
> between Perl's built-in print() function in Perl 5.8.0+ and the
> Apache->print() method that mod_perl overrides it with. I've consulted
> the documentation on the mod_perl website, and could find no mention of
> the difference. If my conclusions below are correct then this
> information may well be worth adding.
[the rest of the very detailed analysis has been snipped]
5.8.0 is a pretty new perl version, which provides the new functionality, and
it seems that hardly anybody has been using the UTF stuff with mod_perl. So I
suppose you are the first one to hit the problem. Certainly we need to update
mod_perl to handle this correctly. Would you be interested to try to make
Apache->print() do the right thing? If not, we should log it in the STATUS
file and hopefully someone will have the time and inclination to solve it.
In any case a simple test that reproduces the problem will be needed.
__________________________________________________________________
Stas Bekman JAm_pH ------> Just Another mod_perl Hacker
http://stason.org/ mod_perl Guide ---> http://perl.apache.org
mailto:stas@stason.org http://use.perl.org http://apacheweek.com
http://modperlbook.org http://apache.org http://ticketmaster.com