You are viewing a plain text version of this content. The canonical link for it is here.
Posted to modperl@perl.apache.org by Drew Wilson <am...@apple.com> on 2007/03/16 20:14:41 UTC

UTF8 fun with SOAP::Lite and mod_perl 1.3.33

I'm trying to track down a Unicode malcoding problem using SOAP::Lite  
0.67 with mod_perl 1.29 on apache 1.3.33.

The problem I'm seeing is my UTF8 strings are transformed in the http  
response.

The strings look correct inside the perl space (e.g. printing to  
STDERR inside the perl handler) but the strings are converted in the  
http packet returned (captured using tcpdump).

For example, if I want to send back a string containing the Unicode  
snowman U2603 (UTF8 E2 98 83), I manually encode the string as:
            my $snowman = '☃';
            my %result = ( 'snowman' => SOAP::Data->type( string =>  
$snowman  ) );

and return it
            return %result;

When watching with tcpdump, I expect to see this UTF8 byte sequence:
	 e2 98 83
but instead see
	c3 a2 c2 98 c2 83

I suspect the UTF8 byte sequence is being treated as a UTF 16  
sequence [00 e2 00 98 00 83], which is then converted to the UTF8  
equivalent byte sequence [c3 a2 c2 98 c2 83].

But I cannot figure out WHERE this conversion is being done.

Is there any way to trace data being written to the response?

BTW - the $snowman string returns 1 for utf8::is_utf8 and utf8::valid.

Thanks for any suggestions,

Drew

Re: UTF8 fun with SOAP::Lite and mod_perl 1.3.33

Posted by Drew Wilson <am...@apple.com>.
OK, I think I've got something that I'm happy with.

There were a couple problems with how my SOAP::Lite handler was  
processing UTF8.

First, SOAP::Lite encodes extended 8-bit strings as base 64. What a  
pain.  I found this workaround <http://london.pm.org/pipermail/ 
london.pm/Week-of-Mon-20061023/005171.html>.

Second, I found this <http://www.perlmonks.org/?node_id=589132> which  
suggested running the string through utf8::decode and this seemed to  
solve my problems.

So my solution ends up looking something like this:
         # make sure Perl marks this as utf8
         utf8::decode($s);
         # encode string ourselves to prevent SOAP::LIte from  
encoding it as base64
         push @results, SOAP::Data->type( string => $s );

It works for me.

Drew


On Mar 16, 2007, at 2:23 PM, Robert Landrum wrote:

> Drew Wilson wrote:
>> But I cannot figure out WHERE this conversion is being done.
>
> I've picked through SOAP::Lite enough to know that unicode  
> conversions are probably more than it knows how to handle.
>
> However, SOAP::Data::encode_data uses a regex to munge data.   
> Perhaps there's a conversion happening in the regex engine that  
> breaks the UTF8.
>
> Rob


Re: UTF8 fun with SOAP::Lite and mod_perl 1.3.33

Posted by Robert Landrum <rl...@aol.net>.
Drew Wilson wrote:
> But I cannot figure out WHERE this conversion is being done.
> 

I've picked through SOAP::Lite enough to know that unicode conversions 
are probably more than it knows how to handle.

However, SOAP::Data::encode_data uses a regex to munge data.  Perhaps 
there's a conversion happening in the regex engine that breaks the UTF8.

Rob

Re: UTF8 fun with SOAP::Lite and mod_perl 1.3.33

Posted by Drew Wilson <am...@apple.com>.
FWIW - I did try forcing the page encoding, but this didn't turn out  
to be necessary as the XML is already utf-8.

Drew

On Mar 16, 2007, at 2:27 PM, Chris Jacobson wrote:

> FWIW, if you tell the client to render the page as UTF-8, your  
> 'broken' mod_perl version works correctly.  The content-type header  
> is instructing the client to render the page using ISO-8859-1,  
> which will result in gremlin characters being rendered.
>
> Aaron Hawryluk wrote:
>> This is suspiciously similar to the problem I had with double-byte  
>> characters coming up where single-byte characters were expected.   
>> If you find the answer to this, could you let me know?  I still  
>> can't migrate to mod_perl due to the problem. Mind you I'm on  
>> Apache2/mp2 so they could be completely unrelated...
>> Here's a sample of what happens:
>> Here it is under my old CGI model (which is now far too CPU- 
>> intensive):
>> http://www.calgarysun.com/cgi-bin/publish.cgi? 
>> p=171082&x=articles&s=showbiz
>> And here it is under mod_perl:
>> http://www.calgarysun.com/perl-bin/publish.cgi? 
>> p=171082&x=articles&s=showbiz
>> Hey! Mod_perl guys! Can you say "reproducibility"?
>> --Aaron Hawryluk
>> Webmaster, The Calgary Sun
>> http://www.calgarysun.com
>> webmaster@calgarysun.com
>> Ph: 403-250-4371
>> -----Original Message-----
>> From: Drew Wilson [mailto:amw@apple.com] Sent: March-16-07 1:15 PM
>> To: modperl mod_perl
>> Subject: UTF8 fun with SOAP::Lite and mod_perl 1.3.33
>> I'm trying to track down a Unicode malcoding problem using  
>> SOAP::Lite  0.67 with mod_perl 1.29 on apache 1.3.33.
>> The problem I'm seeing is my UTF8 strings are transformed in the  
>> http  response.
>> The strings look correct inside the perl space (e.g. printing to   
>> STDERR inside the perl handler) but the strings are converted in  
>> the  http packet returned (captured using tcpdump).
>> For example, if I want to send back a string containing the  
>> Unicode  snowman U2603 (UTF8 E2 98 83), I manually encode the  
>> string as:
>>             my $snowman = '☃';
>>             my %result = ( 'snowman' => SOAP::Data->type( string  
>> =>  $snowman  ) );
>> and return it
>>             return %result;
>> When watching with tcpdump, I expect to see this UTF8 byte sequence:
>> 	 e2 98 83
>> but instead see
>> 	c3 a2 c2 98 c2 83
>> I suspect the UTF8 byte sequence is being treated as a UTF 16   
>> sequence [00 e2 00 98 00 83], which is then converted to the UTF8   
>> equivalent byte sequence [c3 a2 c2 98 c2 83].
>> But I cannot figure out WHERE this conversion is being done.
>> Is there any way to trace data being written to the response?
>> BTW - the $snowman string returns 1 for utf8::is_utf8 and  
>> utf8::valid.
>> Thanks for any suggestions,
>> Drew
>
> -- 
> ____________________________________________________________________
> Chris Jacobson                         Phone: (513) 665-9070 x310
> Online-Rewards                         Fax  : (214) 242-4448
> 403 Vine Street, Second Floor          http://www.online-rewards.com
> Cincinnati, OH 45202
>


UTF-8 fun [was: UTF8 fun with SOAP::Lite and mod_perl 1.3.33]

Posted by Aaron Hawryluk <we...@calgarysun.com>.
If I instruct the browser to render to UTF-8, The strange characters disappear, but the proper characters don't show up - instead I get the gap indicative of a non-rendering character or nothing at all, depending on the browser (IE and FF do different things here - big surprise).

The problem as I see it is that the sytem locale is set to ISO-8859-1 (and mysql should be using the system locale), apache is set to ISO-8859-1, and yet for some reason UTF-8 (possibly - not necessarily - just double-byte instead of single-byte) is coming out of mod_perl where regular cgi is just pumping out (normal) ISO-8859-1.  Switching system locales might have some effect, so I'll test that on a development machine and see what happens.  Here goes nothing...

-----Original Message-----
From: Chris Jacobson [mailto:chris.jacobson@online-rewards.com] 
Sent: March-16-07 3:27 PM
To: Aaron Hawryluk
Cc: 'Drew Wilson'; 'modperl mod_perl'
Subject: Re: UTF8 fun with SOAP::Lite and mod_perl 1.3.33

FWIW, if you tell the client to render the page as UTF-8, your 'broken' 
mod_perl version works correctly.  The content-type header is 
instructing the client to render the page using ISO-8859-1, which will 
result in gremlin characters being rendered.

Aaron Hawryluk wrote:
> This is suspiciously similar to the problem I had with double-byte characters coming up where single-byte characters were expected.  If you find the answer to this, could you let me know?  I still can't migrate to mod_perl due to the problem. Mind you I'm on Apache2/mp2 so they could be completely unrelated...
> 
> Here's a sample of what happens:
> 
> Here it is under my old CGI model (which is now far too CPU-intensive):
> http://www.calgarysun.com/cgi-bin/publish.cgi?p=171082&x=articles&s=showbiz
> 
> And here it is under mod_perl:
> http://www.calgarysun.com/perl-bin/publish.cgi?p=171082&x=articles&s=showbiz
> 
> Hey! Mod_perl guys! Can you say "reproducibility"?
> 
> 
> --Aaron Hawryluk
> Webmaster, The Calgary Sun
> http://www.calgarysun.com
> webmaster@calgarysun.com
> Ph: 403-250-4371
> 
> 
> -----Original Message-----
> From: Drew Wilson [mailto:amw@apple.com] 
> Sent: March-16-07 1:15 PM
> To: modperl mod_perl
> Subject: UTF8 fun with SOAP::Lite and mod_perl 1.3.33
> 
> I'm trying to track down a Unicode malcoding problem using SOAP::Lite  
> 0.67 with mod_perl 1.29 on apache 1.3.33.
> 
> The problem I'm seeing is my UTF8 strings are transformed in the http  
> response.
> 
> The strings look correct inside the perl space (e.g. printing to  
> STDERR inside the perl handler) but the strings are converted in the  
> http packet returned (captured using tcpdump).
> 
> For example, if I want to send back a string containing the Unicode  
> snowman U2603 (UTF8 E2 98 83), I manually encode the string as:
>             my $snowman = '☃';
>             my %result = ( 'snowman' => SOAP::Data->type( string =>  
> $snowman  ) );
> 
> and return it
>             return %result;
> 
> When watching with tcpdump, I expect to see this UTF8 byte sequence:
> 	 e2 98 83
> but instead see
> 	c3 a2 c2 98 c2 83
> 
> I suspect the UTF8 byte sequence is being treated as a UTF 16  
> sequence [00 e2 00 98 00 83], which is then converted to the UTF8  
> equivalent byte sequence [c3 a2 c2 98 c2 83].
> 
> But I cannot figure out WHERE this conversion is being done.
> 
> Is there any way to trace data being written to the response?
> 
> BTW - the $snowman string returns 1 for utf8::is_utf8 and utf8::valid.
> 
> Thanks for any suggestions,
> 
> Drew
> 
> 
> 

-- 
____________________________________________________________________
Chris Jacobson                         Phone: (513) 665-9070 x310
Online-Rewards                         Fax  : (214) 242-4448
403 Vine Street, Second Floor          http://www.online-rewards.com
Cincinnati, OH 45202


Re: UTF8 fun with SOAP::Lite and mod_perl 1.3.33

Posted by Chris Jacobson <ch...@online-rewards.com>.
FWIW, if you tell the client to render the page as UTF-8, your 'broken' 
mod_perl version works correctly.  The content-type header is 
instructing the client to render the page using ISO-8859-1, which will 
result in gremlin characters being rendered.

Aaron Hawryluk wrote:
> This is suspiciously similar to the problem I had with double-byte characters coming up where single-byte characters were expected.  If you find the answer to this, could you let me know?  I still can't migrate to mod_perl due to the problem. Mind you I'm on Apache2/mp2 so they could be completely unrelated...
> 
> Here's a sample of what happens:
> 
> Here it is under my old CGI model (which is now far too CPU-intensive):
> http://www.calgarysun.com/cgi-bin/publish.cgi?p=171082&x=articles&s=showbiz
> 
> And here it is under mod_perl:
> http://www.calgarysun.com/perl-bin/publish.cgi?p=171082&x=articles&s=showbiz
> 
> Hey! Mod_perl guys! Can you say "reproducibility"?
> 
> 
> --Aaron Hawryluk
> Webmaster, The Calgary Sun
> http://www.calgarysun.com
> webmaster@calgarysun.com
> Ph: 403-250-4371
> 
> 
> -----Original Message-----
> From: Drew Wilson [mailto:amw@apple.com] 
> Sent: March-16-07 1:15 PM
> To: modperl mod_perl
> Subject: UTF8 fun with SOAP::Lite and mod_perl 1.3.33
> 
> I'm trying to track down a Unicode malcoding problem using SOAP::Lite  
> 0.67 with mod_perl 1.29 on apache 1.3.33.
> 
> The problem I'm seeing is my UTF8 strings are transformed in the http  
> response.
> 
> The strings look correct inside the perl space (e.g. printing to  
> STDERR inside the perl handler) but the strings are converted in the  
> http packet returned (captured using tcpdump).
> 
> For example, if I want to send back a string containing the Unicode  
> snowman U2603 (UTF8 E2 98 83), I manually encode the string as:
>             my $snowman = '☃';
>             my %result = ( 'snowman' => SOAP::Data->type( string =>  
> $snowman  ) );
> 
> and return it
>             return %result;
> 
> When watching with tcpdump, I expect to see this UTF8 byte sequence:
> 	 e2 98 83
> but instead see
> 	c3 a2 c2 98 c2 83
> 
> I suspect the UTF8 byte sequence is being treated as a UTF 16  
> sequence [00 e2 00 98 00 83], which is then converted to the UTF8  
> equivalent byte sequence [c3 a2 c2 98 c2 83].
> 
> But I cannot figure out WHERE this conversion is being done.
> 
> Is there any way to trace data being written to the response?
> 
> BTW - the $snowman string returns 1 for utf8::is_utf8 and utf8::valid.
> 
> Thanks for any suggestions,
> 
> Drew
> 
> 
> 

-- 
____________________________________________________________________
Chris Jacobson                         Phone: (513) 665-9070 x310
Online-Rewards                         Fax  : (214) 242-4448
403 Vine Street, Second Floor          http://www.online-rewards.com
Cincinnati, OH 45202


RE: UTF8 fun with SOAP::Lite and mod_perl 1.3.33

Posted by Aaron Hawryluk <we...@calgarysun.com>.
This is suspiciously similar to the problem I had with double-byte characters coming up where single-byte characters were expected.  If you find the answer to this, could you let me know?  I still can't migrate to mod_perl due to the problem. Mind you I'm on Apache2/mp2 so they could be completely unrelated...

Here's a sample of what happens:

Here it is under my old CGI model (which is now far too CPU-intensive):
http://www.calgarysun.com/cgi-bin/publish.cgi?p=171082&x=articles&s=showbiz

And here it is under mod_perl:
http://www.calgarysun.com/perl-bin/publish.cgi?p=171082&x=articles&s=showbiz

Hey! Mod_perl guys! Can you say "reproducibility"?


--Aaron Hawryluk
Webmaster, The Calgary Sun
http://www.calgarysun.com
webmaster@calgarysun.com
Ph: 403-250-4371


-----Original Message-----
From: Drew Wilson [mailto:amw@apple.com] 
Sent: March-16-07 1:15 PM
To: modperl mod_perl
Subject: UTF8 fun with SOAP::Lite and mod_perl 1.3.33

I'm trying to track down a Unicode malcoding problem using SOAP::Lite  
0.67 with mod_perl 1.29 on apache 1.3.33.

The problem I'm seeing is my UTF8 strings are transformed in the http  
response.

The strings look correct inside the perl space (e.g. printing to  
STDERR inside the perl handler) but the strings are converted in the  
http packet returned (captured using tcpdump).

For example, if I want to send back a string containing the Unicode  
snowman U2603 (UTF8 E2 98 83), I manually encode the string as:
            my $snowman = '☃';
            my %result = ( 'snowman' => SOAP::Data->type( string =>  
$snowman  ) );

and return it
            return %result;

When watching with tcpdump, I expect to see this UTF8 byte sequence:
	 e2 98 83
but instead see
	c3 a2 c2 98 c2 83

I suspect the UTF8 byte sequence is being treated as a UTF 16  
sequence [00 e2 00 98 00 83], which is then converted to the UTF8  
equivalent byte sequence [c3 a2 c2 98 c2 83].

But I cannot figure out WHERE this conversion is being done.

Is there any way to trace data being written to the response?

BTW - the $snowman string returns 1 for utf8::is_utf8 and utf8::valid.

Thanks for any suggestions,

Drew