You are viewing a plain text version of this content. The canonical link for it is here.

Posted to modperl@perl.apache.org by Steve Hay <st...@uk.radan.com> on 2003/07/11 15:06:51 UTC

Undocumented behaviour in Apache->print()?

Hi,

I've just spent quite a while tracking down a problem with a web page 
generated by a mod_perl program in which 8-bit ISO-8859-1 characters 
were not being shown properly.  The software runs via Apache::Registry, 
and works fine under mod_cgi.

It turns out that the problem is due to a difference in behaviour 
between Perl's built-in print() function in Perl 5.8.0+ and the 
Apache->print() method that mod_perl overrides it with.  I've consulted 
the documentation on the mod_perl website, and could find no mention of 
the difference.  If my conclusions below are correct then this 
information may well be worth adding.

Under Perl 5.8.0, if a string stored in Perl's internal UTF-8 format is 
passed to print() then by default it will be converted to the machine's 
native 8-bit character set on output to STDOUT.  In my case, this is 
exactly as if I had called binmode(STDOUT, ':encoding(iso-8859-1)') 
before the print().  (If any characters in the UTF-8 string are not 
representable in ISO-8859-1 then a "Wide character in print()" warning 
will be emitted, and the bytes that make up that UTF-8 character will be 
output.)

However, mod_perl's Apache->print() method does not perform this 
automatic conversion.  It simply prints the bytes that make up each 
UTF-8 character (i.e. it outputs the UTF-8 string as UTF-8), exactly as 
if you had called binmode(STDOUT, ':utf8') before Apache->print().  (No 
"Wide character in print()" warnings are produced for charcaters with 
code points > 0xFF either.)

The test program below illustrates this difference:

    use 5.008;
    use strict;
    use warnings;
    use Encode;

    my $cset = 'ISO-8859-1';
    #my $cset = 'UTF-8';

    print "Content-type: text/html; charset=$cset\n\n";
    print <<EOT;
    <html>
    <head>
    <meta http-equiv="Content-type" content="text/html; charset=$cset">
    </head>
    <body>
    EOT

    # $str is stored in Perl's internal UTF-8 format.
    my $str = Encode::decode('iso-8859-1', 'Zurück');
    print "<p>$str</p>\n";

    print <<EOT;
    </body>
    </html>
    EOT

Running under mod_cgi (using Perl's built-in print() function) the UTF-8 
encoded data in $str is converted to ISO-8859-1 on-the-fly by the 
print(), and the end-user will see the intended output when $cset is 
ISO-8859-1.  (Changing $cset to UTF-8 causes the ü to be replaced with ? 
in my web browser because the ü which is output is not a valid UTF-8 
character (which the output is labelled as).)

Running under mod_perl (with Perl's built-in print() function now 
overridden by the Apache->print() method) the UTF-8 encoded data in $str 
is NOT converted to ISO-8859-1 on-the-fly as it is printed, and the 
end-user will see the two bytes that make up the UTF-8 representation of 
ü when $cset is ISO-8859-1.  Changing $cset to UTF-8 in this case 
"fixes" it, because the output stream in this case happens to be valid 
UTF-8 all the way through.

There are two solutions to this:

1. To use $cset = 'ISO-8859-1': Explicitly convert the UTF-8 data in 
$str to ISO-8859-1 yourself before sending it to print(), rather than 
relying on print() to do that for you.  This is, in general, not 
possible (not all characters in the UTF-8 string may be representable in 
ISO-8859-1), but for HTML output we can arrange for Encode::encode to 
convert any non-representable charcaters to their HTML character references:

    $str = Encode::encode('iso-8859-1', $str, Encode::FB_HTMLCREF);

2. To use $cset = 'UTF-8': Output UTF-8 directly, ensuring that *all* 
outgoing data is UTF-8 by adding an appropriate layer on STDOUT:

    binmode STDOUT, ':utf8';

The second method here is generally to be preferred, but in the old 
software that I was experiencing problems with, I was not able to add 
the utf8 layer to STDOUT reliably (the data was being output from a 
multitude of print() statements scattered in various places), so I stuck 
with the first method.  I believed that it should work without the 
explicit encoding to ISO-8859-1 because I was unaware that mod_perl's 
print() override removed Perl's implicit encoding behaviour.  Actually, 
the explicit encoding above is better anyway because it also handles 
characters that can't be encoded to ISO-8859-1, but nevertheless I think 
the difference in mod_perl's print() is still worth mentioning in the 
documentation somewhere.

Cheers,

Steve

Re: Undocumented behaviour in Apache->print()?

Posted by Stas Bekman <st...@stason.org>.

Steve Hay wrote:

>> 5.8.0 is a pretty new perl version, which provides the new 
>> functionality, and it seems that hardly anybody has been using the UTF 
>> stuff with mod_perl.
> 
> 
> 5.8.0 is actually a couple of days short of being one year old (happy 
> birthday!), which is increasingly not that new any more.  5.8.1 should 
> be out soon too.

I meant that it was too new to be embraced by the crowd. it'll probably take a 
few more years before this will happen. In any case, this is just an excuse ;)

> As for hardly anybody using UTF8 stuff with mod_perl... I didn't think 
> that I was until I realised that most XML parsers (certainly the two 
> that I most uses -- XML::LibXML and XML::DOM) return all their data in 
> Perl's internal UTF-8 format!  Then the penny dropped that I was 
> actually using it rather a lot :-)

I thought XML was dead. Do people still use this archaic technology? I went to 
this session at this OS conference with many k00l ppls and there was this 
dude[1] who said that YAML is the future. Next they started talking about 
animals, and for some reason everybody liked ponie. All well, orange people 
[2], orange sites [3], orange ponies [4], jetlag, too many flights, too little 
sleep...

1: 
http://husk.org/pics/imgs/people/perl/london.pm_ingy_2001-07-30/ingy_nino_tired.jpg
2: 
http://husk.org/pics/imgs/people/perl/london.pm_ingy_2001-07-30/acme_perl_hacker_scary.jpg
3: http://search.cpan.org/
4: http://ponie.kwiki.org/ http://www.poniecode.org/

;)

__________________________________________________________________
Stas Bekman            JAm_pH ------> Just Another mod_perl Hacker
http://stason.org/     mod_perl Guide ---> http://perl.apache.org
mailto:stas@stason.org http://use.perl.org http://apacheweek.com
http://modperlbook.org http://apache.org   http://ticketmaster.com

Re: Undocumented behaviour in Apache->print()?

Posted by Steve Hay <st...@uk.radan.com>.

Steve Hay wrote:

> Stas Bekman wrote:
>
>
>> > I have attempted to shoe-horn this into mod_perl's print() method (in
>> > "src/modules/perl/Apache.xs").  Here's the diff against mod_perl 1.28:
>> > [Unfortunately, I've had to comment-out the first part of that "if"
>> > block, because I got an unresolved external symbol error relating 
>> to the
>> > PerlIO_isutf8() function otherwise (which may be because that function
>> > isn't documented in the perlapio manpage).]
>>
>> mod_perl 1.x doesn't use perlio, hence you have this problem. adding:
>>
>> #include "perlio.h"
>>
>> should resolve it I think. 
>
>
> No.  The error was "unresolved external symbol", which means that the 
> compiler is happy (it evidently has pulled in perlio.h, or something 
> else that declares PerlIO_isutf8() as "extern ..."), but that the 
> linker couldn't find the definition of that function.
>
> (Check: If I change "PerlIO_isutf8" to "PerlIO_isutf" (deliberate 
> typo) then I get a different error - "undefined; assuming extern 
> returning int" - because now no declaration has been supplied.)
>
> Listing the symbols exported from perl58.lib shows that PerlIO_isutf8 
> is *not* one of them.  So where's the definition supposed to come from?
>
> I'll ask about this on the perlxs mailing list, I think. 

Having asked about this, it turns out that the problem was 
PerlIO_isutf8() not being exported from perl58.lib on Windows (and other 
platforms where the symbols to export have to be explicitly listed).

I sent a patch off to p5p which fixes this 
(http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2003-07/msg01096.html), 
and it has been integrated as #20203.

So presumably this will not be a problem in the upcoming perl-5.8.1, but 
how do we cope with the perl-5.8.0 case?

I've attached a new patch (against mod_perl-1.28) which (I believe) 
fixes the UTF-8 problem, but it won't build on Windows with perl-5.8.0 
without #20203.

Steve

Re: Undocumented behaviour in Apache->print()?

Posted by Steve Hay <st...@uk.radan.com>.

Stas Bekman wrote:

> Steve Hay wrote:
>
>>
>> It's only Perl 5.8 that has the special "UTF-8 flag" which the 
>> functions above all operate with respect to.  If a Perl variable 
>> contains a sequence of bytes that make up a valid UTF-8 character, 
>> but the string is not flagged with Perl's special flag, then Perl's 
>> built-in print() doesn't do this automatic conversion anyway.
>
>
> Yes.
>
> Apps wanting to handle utf will need to 'require 5.008;' as in your 
> example.
>
>> IOW,
>>
>>    print "Content-type: text/plain\n\n";
>>    $a = "\xC3\xBC";
>>    print $a;
>>
>> retrieved from a mod_cgi server produces (via od -b / od -c):
>>
>>    0000000 303 274
>>    0000002
>
>
> yup, because you need to add utf8::decode($a); before printing $a. 
> Which your version does as well. 

(Indeed.  I meant it as example of how Perl's (5.8's) print() doesn't do 
the conversion on strings that are not *flagged* as UTF-8, even when 
they make valid UTF-8.)

>
>
>> Perl 5.6 and older don't have the UTF-8 flag and hence don't do any 
>> automatic conversion via print().  Therefore, mod_perl's print() 
>> should not have the difference from Perl's print() that exists in 
>> 5.8, so no change should be required.
>>
>> Sure enough, looking at the "doio.c" source file in Perl 5.6.1, the 
>> entire chunk of code that I half-inched above is not present.
>
>
> So you suggest that we copy this functionality from Perl. So if need 
> to #ifdef it for 5.8.0. 

So I'll add

#if PERL_VERSION >= 8
...
#endif

around the code that I've added.

>
>
> > I have attempted to shoe-horn this into mod_perl's print() method (in
> > "src/modules/perl/Apache.xs").  Here's the diff against mod_perl 1.28:
> > [Unfortunately, I've had to comment-out the first part of that "if"
> > block, because I got an unresolved external symbol error relating to 
> the
> > PerlIO_isutf8() function otherwise (which may be because that function
> > isn't documented in the perlapio manpage).]
> >
> > --- Apache.xs.orig    2003-06-06 12:31:10.000000000 +0100
> > +++ Apache.xs    2003-07-15 12:20:42.000000000 +0100
> > @@ -1119,12 +1119,25 @@
> >     SV *sv = sv_newmortal();
> >     SV *rp = ST(0);
> >     SV *sendh = perl_get_sv("Apache::__SendHeader", TRUE);
> > +    /*PerlIO *fp = PerlIO_stdout();*/
> >
> >     if(items > 2)
> >         do_join(sv, &sv_no, MARK+1, SP); /* $sv = join '', 
> @_[1..$#_] */
> >     else
> >         sv_setsv(sv, ST(1));
> >
> > +    /*if (PerlIO_isutf8(fp)) {
> > +        if (!SvUTF8(sv))
> > +        sv_utf8_upgrade(sv = sv_mortalcopy(sv));
> > +    }
> > +    else*/ if (DO_UTF8(sv)) {
> > +        if (!sv_utf8_downgrade((sv = sv_mortalcopy(sv)), TRUE)
> > +        && ckWARN_d(WARN_UTF8))
> > +        {
> > +        Perl_warner(aTHX_ packWARN(WARN_UTF8), "Wide character in 
> print");
> > +        }
> > +    }
> > +
> >     PUSHMARK(sp);
> >     XPUSHs(rp);
> >     XPUSHs(sv);
> >
> > Besides the problem with PerlIO_isutf8(),
>
> mod_perl 1.x doesn't use perlio, hence you have this problem. adding:
>
> #include "perlio.h"
>
> should resolve it I think. 

No.  The error was "unresolved external symbol", which means that the 
compiler is happy (it evidently has pulled in perlio.h, or something 
else that declares PerlIO_isutf8() as "extern ..."), but that the linker 
couldn't find the definition of that function.

(Check: If I change "PerlIO_isutf8" to "PerlIO_isutf" (deliberate typo) 
then I get a different error - "undefined; assuming extern returning 
int" - because now no declaration has been supplied.)

Listing the symbols exported from perl58.lib shows that PerlIO_isutf8 is 
*not* one of them.  So where's the definition supposed to come from?

I'll ask about this on the perlxs mailing list, I think.

>
>
> > there are other problems that
> > spring to my mind straight away with this:
> > - is getting the PerlIO * for STDOUT to right thing to be doing anyway?
>
> PerlIO *fp = IoOFP(GvIOp(defoutgv)) 

Seems to work OK for me.  What's defoutgv?

>
>
> > - if "items > 2", do we need to handle the UTF8-ness of each of those
> > items individually before we join them?
>
> I'm not sure, how perl handles this? 

Struggling as best as I can to read pp_print() in Perl's "pp_hot.c", it 
looks like Perl calls do_print() (which contains the UTF-8 handling that 
I've stolen) for each item in the list that is passed to it.

Considering this more, I think that it probably isn't an issue: if you 
have two variables in Perl, one of which is flagged UTF-8 and the other 
of which isn't, then when you concatenate them, the whole is "upgraded" 
to flagged UTF-8 anyway.

However, it has occurred to me that I've missed out adding the UTF-8 
handling to half of mod_perl's print() method!

The method is split into two halves:

    if (!mod_perl_sent_header(r, 0)) {
    ...
    } else {
    ...
    }

and I've only handled the first half!

The first half joins all of the items together and then calls 
send_cgi_header().  That outputs everything down to the first blank line 
(i.e. all the headers), then sets the "sent headers" flag and recurses 
on $r->print().  Next time around, we'll enter the second half, which 
simply calls write_client().

If we've already been through the first half then the UTF-8 conversion 
will have been applied already, but if we come straight into the second 
half (i.e. by printing the headers and the body separately) then the 
UTF-8 conversion will not yet have been applied.  So as my patch stands,

    use utf8;
    $a = "\xC3\xBC";
    utf8::decode($a);
    print "Content-type: text/plain\n\n", $a;

will have the UTF-8 data in $a handled, but

    use utf8;
    $a = "\xC3\xBC";
    utf8::decode($a);
    print "Content-type: text/plain\n\n";
    print $a;

will not!

The write_client() method appears to call rwrite() (Apache's 
ap_rwrite()?) for each item in the list that is passed to it, so I 
suppose I should also add the UTF-8 handling code to each of those items 
too.  (This means that if the headers and body *are* printed together 
then the body will be UTF-8-handled twice -- once in the first half of 
print(), and then again in write_client().  However, that's "safe": the 
handling just ensures that the data is in the appropriate format.  It 
knows not to do anything if it is already in the correct format.)

I've attached a patch that incorporates these changes (with the 
PerlIO_isutf8() stuff still commented out until I figure out what to do 
about it).

Steve

Re: Undocumented behaviour in Apache->print()?

Posted by Stas Bekman <st...@stason.org>.

Steve Hay wrote:
> Stas Bekman wrote:
> 
>>> I have attempted to shoe-horn this into mod_perl's print() method (in 
>>> "src/modules/perl/Apache.xs").  Here's the diff against mod_perl 
>>> 1.28:  [Unfortunately, I've had to comment-out the first part of that 
>>> "if" block, because I got an unresolved external symbol error 
>>> relating to the PerlIO_isutf8() function otherwise (which may be 
>>> because that function isn't documented in the perlapio manpage).]
>>>
>>> --- Apache.xs.orig    2003-06-06 12:31:10.000000000 +0100
>>> +++ Apache.xs    2003-07-15 12:20:42.000000000 +0100
>>> @@ -1119,12 +1119,25 @@
>>>     SV *sv = sv_newmortal();
>>>     SV *rp = ST(0);
>>>     SV *sendh = perl_get_sv("Apache::__SendHeader", TRUE);
>>> +    /*PerlIO *fp = PerlIO_stdout();*/
>>>
>>>     if(items > 2)
>>>         do_join(sv, &sv_no, MARK+1, SP); /* $sv = join '', @_[1..$#_] */
>>>     else
>>>         sv_setsv(sv, ST(1));
>>>
>>> +    /*if (PerlIO_isutf8(fp)) {
>>> +        if (!SvUTF8(sv))
>>> +        sv_utf8_upgrade(sv = sv_mortalcopy(sv));
>>> +    }
>>> +    else*/ if (DO_UTF8(sv)) {
>>> +        if (!sv_utf8_downgrade((sv = sv_mortalcopy(sv)), TRUE)
>>> +        && ckWARN_d(WARN_UTF8))
>>> +        {
>>> +        Perl_warner(aTHX_ packWARN(WARN_UTF8), "Wide character in 
>>> print");
>>> +        }
>>> +    }
>>> +
>>>     PUSHMARK(sp);
>>>     XPUSHs(rp);
>>>     XPUSHs(sv);
>>>
>>> Besides the problem with PerlIO_isutf8(), there are other problems 
>>> that spring to my mind straight away with this:
>>> - is getting the PerlIO * for STDOUT to right thing to be doing anyway?
>>> - if "items > 2", do we need to handle the UTF8-ness of each of those 
>>> items individually before we join them?
>>> - we need to code this in such a way as to remain backwards 
>>> compatible with older Perls.
>>
>>
>>
>> looks like this is the main question. Do we handle utf8 only for perl 
>> 5.8? 
> 
> 
> It's only Perl 5.8 that has the special "UTF-8 flag" which the functions 
> above all operate with respect to.  If a Perl variable contains a 
> sequence of bytes that make up a valid UTF-8 character, but the string 
> is not flagged with Perl's special flag, then Perl's built-in print() 
> doesn't do this automatic conversion anyway.

Yes.

Apps wanting to handle utf will need to 'require 5.008;' as in your example.

> IOW,
> 
>    print "Content-type: text/plain\n\n";
>    $a = "\xC3\xBC";
>    print $a;
> 
> retrieved from a mod_cgi server produces (via od -b / od -c):
> 
>    0000000 303 274
>    0000002

yup, because you need to add utf8::decode($a); before printing $a. Which your 
version does as well.

> Perl 5.6 and older don't have the UTF-8 flag and hence don't do any 
> automatic conversion via print().  Therefore, mod_perl's print() should 
> not have the difference from Perl's print() that exists in 5.8, so no 
> change should be required.
> 
> Sure enough, looking at the "doio.c" source file in Perl 5.6.1, the 
> entire chunk of code that I half-inched above is not present.

So you suggest that we copy this functionality from Perl. So if need to #ifdef 
it for 5.8.0.

 > I have attempted to shoe-horn this into mod_perl's print() method (in
 > "src/modules/perl/Apache.xs").  Here's the diff against mod_perl 1.28:
 > [Unfortunately, I've had to comment-out the first part of that "if"
 > block, because I got an unresolved external symbol error relating to the
 > PerlIO_isutf8() function otherwise (which may be because that function
 > isn't documented in the perlapio manpage).]
 >
 > --- Apache.xs.orig    2003-06-06 12:31:10.000000000 +0100
 > +++ Apache.xs    2003-07-15 12:20:42.000000000 +0100
 > @@ -1119,12 +1119,25 @@
 >     SV *sv = sv_newmortal();
 >     SV *rp = ST(0);
 >     SV *sendh = perl_get_sv("Apache::__SendHeader", TRUE);
 > +    /*PerlIO *fp = PerlIO_stdout();*/
 >
 >     if(items > 2)
 >         do_join(sv, &sv_no, MARK+1, SP); /* $sv = join '', @_[1..$#_] */
 >     else
 >         sv_setsv(sv, ST(1));
 >
 > +    /*if (PerlIO_isutf8(fp)) {
 > +        if (!SvUTF8(sv))
 > +        sv_utf8_upgrade(sv = sv_mortalcopy(sv));
 > +    }
 > +    else*/ if (DO_UTF8(sv)) {
 > +        if (!sv_utf8_downgrade((sv = sv_mortalcopy(sv)), TRUE)
 > +        && ckWARN_d(WARN_UTF8))
 > +        {
 > +        Perl_warner(aTHX_ packWARN(WARN_UTF8), "Wide character in print");
 > +        }
 > +    }
 > +
 >     PUSHMARK(sp);
 >     XPUSHs(rp);
 >     XPUSHs(sv);
 >
 > Besides the problem with PerlIO_isutf8(),

mod_perl 1.x doesn't use perlio, hence you have this problem. adding:

#include "perlio.h"

should resolve it I think.

 > there are other problems that
 > spring to my mind straight away with this:
 > - is getting the PerlIO * for STDOUT to right thing to be doing anyway?

PerlIO *fp = IoOFP(GvIOp(defoutgv))

 > - if "items > 2", do we need to handle the UTF8-ness of each of those
 > items individually before we join them?

I'm not sure, how perl handles this?

__________________________________________________________________
Stas Bekman            JAm_pH ------> Just Another mod_perl Hacker
http://stason.org/     mod_perl Guide ---> http://perl.apache.org
mailto:stas@stason.org http://use.perl.org http://apacheweek.com
http://modperlbook.org http://apache.org   http://ticketmaster.com

Re: Undocumented behaviour in Apache->print()?

Posted by Steve Hay <st...@uk.radan.com>.

Stas Bekman wrote:

>> I have attempted to shoe-horn this into mod_perl's print() method (in 
>> "src/modules/perl/Apache.xs").  Here's the diff against mod_perl 
>> 1.28:  [Unfortunately, I've had to comment-out the first part of that 
>> "if" block, because I got an unresolved external symbol error 
>> relating to the PerlIO_isutf8() function otherwise (which may be 
>> because that function isn't documented in the perlapio manpage).]
>>
>> --- Apache.xs.orig    2003-06-06 12:31:10.000000000 +0100
>> +++ Apache.xs    2003-07-15 12:20:42.000000000 +0100
>> @@ -1119,12 +1119,25 @@
>>     SV *sv = sv_newmortal();
>>     SV *rp = ST(0);
>>     SV *sendh = perl_get_sv("Apache::__SendHeader", TRUE);
>> +    /*PerlIO *fp = PerlIO_stdout();*/
>>
>>     if(items > 2)
>>         do_join(sv, &sv_no, MARK+1, SP); /* $sv = join '', @_[1..$#_] */
>>     else
>>         sv_setsv(sv, ST(1));
>>
>> +    /*if (PerlIO_isutf8(fp)) {
>> +        if (!SvUTF8(sv))
>> +        sv_utf8_upgrade(sv = sv_mortalcopy(sv));
>> +    }
>> +    else*/ if (DO_UTF8(sv)) {
>> +        if (!sv_utf8_downgrade((sv = sv_mortalcopy(sv)), TRUE)
>> +        && ckWARN_d(WARN_UTF8))
>> +        {
>> +        Perl_warner(aTHX_ packWARN(WARN_UTF8), "Wide character in 
>> print");
>> +        }
>> +    }
>> +
>>     PUSHMARK(sp);
>>     XPUSHs(rp);
>>     XPUSHs(sv);
>>
>> Besides the problem with PerlIO_isutf8(), there are other problems 
>> that spring to my mind straight away with this:
>> - is getting the PerlIO * for STDOUT to right thing to be doing anyway?
>> - if "items > 2", do we need to handle the UTF8-ness of each of those 
>> items individually before we join them?
>> - we need to code this in such a way as to remain backwards 
>> compatible with older Perls.
>
>
> looks like this is the main question. Do we handle utf8 only for perl 
> 5.8? 

It's only Perl 5.8 that has the special "UTF-8 flag" which the functions 
above all operate with respect to.  If a Perl variable contains a 
sequence of bytes that make up a valid UTF-8 character, but the string 
is not flagged with Perl's special flag, then Perl's built-in print() 
doesn't do this automatic conversion anyway.

IOW,

    print "Content-type: text/plain\n\n";
    $a = "\xC3\xBC";
    print $a;

retrieved from a mod_cgi server produces (via od -b / od -c):

    0000000 303 274
    0000002

Perl 5.6 and older don't have the UTF-8 flag and hence don't do any 
automatic conversion via print().  Therefore, mod_perl's print() should 
not have the difference from Perl's print() that exists in 5.8, so no 
change should be required.

Sure enough, looking at the "doio.c" source file in Perl 5.6.1, the 
entire chunk of code that I half-inched above is not present.

Steve

Re: Undocumented behaviour in Apache->print()?

Posted by Stas Bekman <st...@stason.org>.

[putting the test case on the top]

Steve Hay wrote:

 >> In any case a simple test that reproduces the problem will be needed.
 >
 >
 > This test program reproduces the problem:
 >
 >    use 5.008;
 >    use Encode;
 >    print "Content-type: text/plain\n\n", decode('iso-8859-1', 'ü');
 >
 > Use LWP's "get" program to get that from an Apache/mod_cgi setup, run it
 > through UNIX's "od -c" (get http://localhost/cgi-bin/test.pl | od -c)
 > and you get:
 >
 >    0000000 374
 >    0000001
 >
 > Try the same from an Apache/mod_perl setup and you get:
 >
 >    0000000 303 274
 >    0000002
 >
 > i.e. the double-byte UTF-8 character representing ü that has been output
 > is converted back to ü by Perl's print() [ü is character 252, octal
 > 374], but is left as the two bytes by Apache's print().
 >
 > I've actually re-built my mod_perl using the half-formed patch given
 > above and it fixes this particular test case!

On my linux box it's 'od -b', 'od -c' prints the actual ascii chars.

I've tested mp2 and it has the same problem. I've used a different version of 
your test:

#!/usr/bin/perl -w
use utf8;
print "Content-type: text/plain\n\n";
$a = "\xC3\xBC";
utf8::decode($a); print $a;

which gives the same char, as in:
% perl -le '$a = "\xC3\xBC"; use utf8; utf8::decode($a); print $a;'
ü

mod_perl 1.0 and 2.0 respond with:

GET 'http://localhost:8002/cgi-bin/test.pl' | od -b
0000000 303 274

and moc_cgi with
0000000 374


> Hmm.  We really need somebody who understands the internals of Perl and 
> mod_perl better than me, but here's a first stab at it:
> 
> The Perl source code contains a pp_print() function in "pp_hot.c" which 
> I presume is basically CORE::print().  It makes use of a do_print() 
> function.  I think that function comes from "doio.c", although it's 
> actually called Perl_do_print() there.  That function does some stuff 
> with the UTF-8 flag, which I guess is the sort of thing that we're 
> after.  Here's a chunk of Perl_do_print() from Perl 5.8.0:
> 
>    if (PerlIO_isutf8(fp)) {
>        if (!SvUTF8(sv))
>        sv_utf8_upgrade(sv = sv_mortalcopy(sv));
>    }
>    else if (DO_UTF8(sv)) {
>        if (!sv_utf8_downgrade((sv = sv_mortalcopy(sv)), TRUE)
>        && ckWARN_d(WARN_UTF8))
>        {
>        Perl_warner(aTHX_ packWARN(WARN_UTF8), "Wide character in print");
>        }
>    }
> 
> I think what this does is look to see if the fp (a PerlIO *) has the 
> ":utf8" encoding layer.  If so, then it upgrades the sv to UTF8 (which 
> is always possible).  If not, then it looks to see if the "bytes" pragma 
> is enabled.  If not, then it downgrades the sv from UTF8 (which is not 
> always possible -- if that fails and the UTF8 warnings category is 
> enabled then it outputs the good ol' "Wide character in print" warning).
> 
> I have attempted to shoe-horn this into mod_perl's print() method (in 
> "src/modules/perl/Apache.xs").  Here's the diff against mod_perl 1.28:  
> [Unfortunately, I've had to comment-out the first part of that "if" 
> block, because I got an unresolved external symbol error relating to the 
> PerlIO_isutf8() function otherwise (which may be because that function 
> isn't documented in the perlapio manpage).]
> 
> --- Apache.xs.orig    2003-06-06 12:31:10.000000000 +0100
> +++ Apache.xs    2003-07-15 12:20:42.000000000 +0100
> @@ -1119,12 +1119,25 @@
>     SV *sv = sv_newmortal();
>     SV *rp = ST(0);
>     SV *sendh = perl_get_sv("Apache::__SendHeader", TRUE);
> +    /*PerlIO *fp = PerlIO_stdout();*/
> 
>     if(items > 2)
>         do_join(sv, &sv_no, MARK+1, SP); /* $sv = join '', @_[1..$#_] */
>     else
>         sv_setsv(sv, ST(1));
> 
> +    /*if (PerlIO_isutf8(fp)) {
> +        if (!SvUTF8(sv))
> +        sv_utf8_upgrade(sv = sv_mortalcopy(sv));
> +    }
> +    else*/ if (DO_UTF8(sv)) {
> +        if (!sv_utf8_downgrade((sv = sv_mortalcopy(sv)), TRUE)
> +        && ckWARN_d(WARN_UTF8))
> +        {
> +        Perl_warner(aTHX_ packWARN(WARN_UTF8), "Wide character in print");
> +        }
> +    }
> +
>     PUSHMARK(sp);
>     XPUSHs(rp);
>     XPUSHs(sv);
> 
> Besides the problem with PerlIO_isutf8(), there are other problems that 
> spring to my mind straight away with this:
> - is getting the PerlIO * for STDOUT to right thing to be doing anyway?
> - if "items > 2", do we need to handle the UTF8-ness of each of those 
> items individually before we join them?
> - we need to code this in such a way as to remain backwards compatible 
> with older Perls.

looks like this is the main question. Do we handle utf8 only for perl 5.8?

__________________________________________________________________
Stas Bekman            JAm_pH ------> Just Another mod_perl Hacker
http://stason.org/     mod_perl Guide ---> http://perl.apache.org
mailto:stas@stason.org http://use.perl.org http://apacheweek.com
http://modperlbook.org http://apache.org   http://ticketmaster.com

Re: Undocumented behaviour in Apache->print()?

Posted by Steve Hay <st...@uk.radan.com>.

Hi Stas,

Stas Bekman wrote:

> Steve Hay wrote:
>
>> Hi,
>>
>> I've just spent quite a while tracking down a problem with a web page 
>> generated by a mod_perl program in which 8-bit ISO-8859-1 characters 
>> were not being shown properly.  The software runs via 
>> Apache::Registry, and works fine under mod_cgi.
>>
>> It turns out that the problem is due to a difference in behaviour 
>> between Perl's built-in print() function in Perl 5.8.0+ and the 
>> Apache->print() method that mod_perl overrides it with.  I've 
>> consulted the documentation on the mod_perl website, and could find 
>> no mention of the difference.  If my conclusions below are correct 
>> then this information may well be worth adding.
>
>
> [the rest of the very detailed analysis has been snipped]
>
> 5.8.0 is a pretty new perl version, which provides the new 
> functionality, and it seems that hardly anybody has been using the UTF 
> stuff with mod_perl.

5.8.0 is actually a couple of days short of being one year old (happy 
birthday!), which is increasingly not that new any more.  5.8.1 should 
be out soon too.

As for hardly anybody using UTF8 stuff with mod_perl... I didn't think 
that I was until I realised that most XML parsers (certainly the two 
that I most uses -- XML::LibXML and XML::DOM) return all their data in 
Perl's internal UTF-8 format!  Then the penny dropped that I was 
actually using it rather a lot :-)

> So I suppose you are the first one to hit the problem. Certainly we 
> need to update mod_perl to handle this correctly. Would you be 
> interested to try to make Apache->print() do the right thing?

Hmm.  We really need somebody who understands the internals of Perl and 
mod_perl better than me, but here's a first stab at it:

The Perl source code contains a pp_print() function in "pp_hot.c" which 
I presume is basically CORE::print().  It makes use of a do_print() 
function.  I think that function comes from "doio.c", although it's 
actually called Perl_do_print() there.  That function does some stuff 
with the UTF-8 flag, which I guess is the sort of thing that we're 
after.  Here's a chunk of Perl_do_print() from Perl 5.8.0:

    if (PerlIO_isutf8(fp)) {
        if (!SvUTF8(sv))
        sv_utf8_upgrade(sv = sv_mortalcopy(sv));
    }
    else if (DO_UTF8(sv)) {
        if (!sv_utf8_downgrade((sv = sv_mortalcopy(sv)), TRUE)
        && ckWARN_d(WARN_UTF8))
        {
        Perl_warner(aTHX_ packWARN(WARN_UTF8), "Wide character in print");
        }
    }

I think what this does is look to see if the fp (a PerlIO *) has the 
":utf8" encoding layer.  If so, then it upgrades the sv to UTF8 (which 
is always possible).  If not, then it looks to see if the "bytes" pragma 
is enabled.  If not, then it downgrades the sv from UTF8 (which is not 
always possible -- if that fails and the UTF8 warnings category is 
enabled then it outputs the good ol' "Wide character in print" warning).

I have attempted to shoe-horn this into mod_perl's print() method (in 
"src/modules/perl/Apache.xs").  Here's the diff against mod_perl 1.28:  
[Unfortunately, I've had to comment-out the first part of that "if" 
block, because I got an unresolved external symbol error relating to the 
PerlIO_isutf8() function otherwise (which may be because that function 
isn't documented in the perlapio manpage).]

--- Apache.xs.orig    2003-06-06 12:31:10.000000000 +0100
+++ Apache.xs    2003-07-15 12:20:42.000000000 +0100
@@ -1119,12 +1119,25 @@
     SV *sv = sv_newmortal();
     SV *rp = ST(0);
     SV *sendh = perl_get_sv("Apache::__SendHeader", TRUE);
+    /*PerlIO *fp = PerlIO_stdout();*/

     if(items > 2)
         do_join(sv, &sv_no, MARK+1, SP); /* $sv = join '', @_[1..$#_] */
     else
         sv_setsv(sv, ST(1));

+    /*if (PerlIO_isutf8(fp)) {
+        if (!SvUTF8(sv))
+        sv_utf8_upgrade(sv = sv_mortalcopy(sv));
+    }
+    else*/ if (DO_UTF8(sv)) {
+        if (!sv_utf8_downgrade((sv = sv_mortalcopy(sv)), TRUE)
+        && ckWARN_d(WARN_UTF8))
+        {
+        Perl_warner(aTHX_ packWARN(WARN_UTF8), "Wide character in print");
+        }
+    }
+
     PUSHMARK(sp);
     XPUSHs(rp);
     XPUSHs(sv);

Besides the problem with PerlIO_isutf8(), there are other problems that 
spring to my mind straight away with this:
- is getting the PerlIO * for STDOUT to right thing to be doing anyway?
- if "items > 2", do we need to handle the UTF8-ness of each of those 
items individually before we join them?
- we need to code this in such a way as to remain backwards compatible 
with older Perls.

Anyway, it's a start.

> If not, we should log it in the STATUS file and hopefully someone will 
> have the time and inclination to solve it. 

Hopefully the above stab at it will encourage somebody to have a serious 
look.

>
>
> In any case a simple test that reproduces the problem will be needed. 

This test program reproduces the problem:

    use 5.008;
    use Encode;
    print "Content-type: text/plain\n\n", decode('iso-8859-1', 'ü');

Use LWP's "get" program to get that from an Apache/mod_cgi setup, run it 
through UNIX's "od -c" (get http://localhost/cgi-bin/test.pl | od -c) 
and you get:

    0000000 374
    0000001

Try the same from an Apache/mod_perl setup and you get:

    0000000 303 274
    0000002

i.e. the double-byte UTF-8 character representing ü that has been output 
is converted back to ü by Perl's print() [ü is character 252, octal 
374], but is left as the two bytes by Apache's print().

I've actually re-built my mod_perl using the half-formed patch given 
above and it fixes this particular test case!

Steve

Re: Undocumented behaviour in Apache->print()?

Posted by Stas Bekman <st...@stason.org>.

Steve Hay wrote:
> Hi,
> 
> I've just spent quite a while tracking down a problem with a web page 
> generated by a mod_perl program in which 8-bit ISO-8859-1 characters 
> were not being shown properly.  The software runs via Apache::Registry, 
> and works fine under mod_cgi.
> 
> It turns out that the problem is due to a difference in behaviour 
> between Perl's built-in print() function in Perl 5.8.0+ and the 
> Apache->print() method that mod_perl overrides it with.  I've consulted 
> the documentation on the mod_perl website, and could find no mention of 
> the difference.  If my conclusions below are correct then this 
> information may well be worth adding.

[the rest of the very detailed analysis has been snipped]

5.8.0 is a pretty new perl version, which provides the new functionality, and 
it seems that hardly anybody has been using the UTF stuff with mod_perl. So I 
suppose you are the first one to hit the problem. Certainly we need to update 
mod_perl to handle this correctly. Would you be interested to try to make 
Apache->print() do the right thing? If not, we should log it in the STATUS 
file and hopefully someone will have the time and inclination to solve it.

In any case a simple test that reproduces the problem will be needed.

__________________________________________________________________
Stas Bekman            JAm_pH ------> Just Another mod_perl Hacker
http://stason.org/     mod_perl Guide ---> http://perl.apache.org
mailto:stas@stason.org http://use.perl.org http://apacheweek.com
http://modperlbook.org http://apache.org   http://ticketmaster.com