You are viewing a plain text version of this content. The canonical link for it is here.
Posted to asp@perl.apache.org by Thanos Chatziathanassiou <tc...@arx.gr> on 2004/03/11 19:40:09 UTC

UTF-8 problems ....

I have an interesting problem with UTF-8 charset and Apache::ASP (possibly):

I had to construct some xml with Apache::ASP some time ago, but due to a 
switch, I now want the resulting file to be UTF-8 encoded instead of 
ISO-8859-7 that it was until now.
This is all fine, however the data I'm reading from the database are 
also in ISO-8859-7.
So I used Script_OnFlush like this:

$$ref =~ s|([\xB8-\xFE])|chr(ord($1)+0x02D0)|sge;
(BTW I also tried ``use Encode;'' and ``use encoding "iso-8859-7";'' of 
perl-5.8 with quite the same results)
in order to convert all greek characters from iso-8859-7 to utf-8 just 
before flushing the output to the client.

The problem is that although I can verify that $$ref contains what I 
want (I also printed it to a file just to make sure), the output to any 
client (Mozilla, Opera, IE, XML Spy, whatever) is truncated and 
obviously not valid.
I've actually used Ethereal to sniff the data on the network and they 
seem to be valid and the charset checks out ok.

Anyone have an idea what might be wrong ?


---------------------------------------------------------------------
To unsubscribe, e-mail: asp-unsubscribe@perl.apache.org
For additional commands, e-mail: asp-help@perl.apache.org


Re: UTF-8 problems ....

Posted by Josh Chamas <jo...@chamas.com>.
Warren Young wrote:
> Josh Chamas wrote:
> 
>>  PerlSetEnv LANG $LANG
>>
>> Does this resolve the UTF8 issues?
> 
> 
> It probably will.  I really like it that you can keep the same LANG 
> variable as the system uses.  I wouldn't like to have hard-coded it.
> 

I think if you have PerlPassEnv LANG, then you will merely pass what
is set at the system level, so you can avoid hard coding it generally.

When one develops with Oracle, one quickly finds out about this problem,
since Oracle clients require a host of %ENV settings in order to function
correctly, including a character set settting, but the standard one is
ORACLE_HOME.

Regards,

Josh
________________________________________________________________________
Josh Chamas, Founder    | NodeWorks - http://www.nodeworks.com
Chamas Enterprises Inc. | NodeWorks Directory - http://dir.nodeworks.com
http://www.chamas.com   | Apache::ASP - http://www.apache-asp.org



---------------------------------------------------------------------
To unsubscribe, e-mail: asp-unsubscribe@perl.apache.org
For additional commands, e-mail: asp-help@perl.apache.org


Re: UTF-8 problems ....

Posted by Warren Young <wa...@etr-usa.com>.
Josh Chamas wrote:

>  PerlSetEnv LANG $LANG
> 
> Does this resolve the UTF8 issues?

It probably will.  I really like it that you can keep the same LANG 
variable as the system uses.  I wouldn't like to have hard-coded it.

Right now, the stable version of my program has worked around this issue 
simply by building in understanding of where the conversions between 
UTF-8 and ISO 8859 occur.  In my development version, I had intended to 
try for keeping data UTF-8 through the entire pipeline, so I will try 
this.  Thanks.

---------------------------------------------------------------------
To unsubscribe, e-mail: asp-unsubscribe@perl.apache.org
For additional commands, e-mail: asp-help@perl.apache.org


Re: UTF-8 problems ....

Posted by Josh Chamas <jo...@chamas.com>.
Warren Young wrote:
> Thanos Chatziathanassiou wrote:
> 
>> I have an interesting problem with UTF-8 charset and Apache::ASP 
> 
> 
> The problem is that the ASP code runs with the LANG environment variable 
> unset.  (I'm not sure if it's Apache or mod_perl doing this.)  In this 
> situation, the Perl interpreter runs without Unicode support.  I posted 
> about a similar problem on the 25th of last month, which you may find 
> enlightening.
> 
> I'm not sure what the right solution to the problem is.  There are a 
> number of things that could be done.  In order of my preference:
> 
> 1. Find out who is unsetting LANG, and make 'em stop it.
> 

mod_perl notoriously does not set up %ENV correctly.  In order for
this to happen, one must use PerlPassEnv LANG, or use PerlSetEnv LANG $LANG
in the httpd.conf.

Does this resolve the UTF8 issues?

Regards,

Josh

________________________________________________________________________
Josh Chamas, Founder    | NodeWorks - http://www.nodeworks.com
Chamas Enterprises Inc. | NodeWorks Directory - http://dir.nodeworks.com
http://www.chamas.com   | Apache::ASP - http://www.apache-asp.org



---------------------------------------------------------------------
To unsubscribe, e-mail: asp-unsubscribe@perl.apache.org
For additional commands, e-mail: asp-help@perl.apache.org


Re: UTF-8 problems ....

Posted by Warren Young <wa...@etr-usa.com>.
Thanos Chatziathanassiou wrote:
> I have an interesting problem with UTF-8 charset and Apache::ASP 

The problem is that the ASP code runs with the LANG environment variable 
unset.  (I'm not sure if it's Apache or mod_perl doing this.)  In this 
situation, the Perl interpreter runs without Unicode support.  I posted 
about a similar problem on the 25th of last month, which you may find 
enlightening.

I'm not sure what the right solution to the problem is.  There are a 
number of things that could be done.  In order of my preference:

1. Find out who is unsetting LANG, and make 'em stop it.

2. Convert your database to UTF-8.  In this case, Perl will pass the 
data without change, since it doesn't change >128 characters when 
running in Unicode-free mode.

3. Find another way besides the LANG variable to ask Perl to enable its 
Unicode support.  Perhaps there's a compile-time option?

4. Somehow force LANG to be set properly for your locale.  I'm not sure 
if you can set this early enough to make the Perl interpreter see it.

5. Do the charset conversion by hand.  The disadvantage to this approach 
is that it's easy to tie your code to one locale, making it nonportable.

---------------------------------------------------------------------
To unsubscribe, e-mail: asp-unsubscribe@perl.apache.org
For additional commands, e-mail: asp-help@perl.apache.org


Re: UTF-8 problems ....

Posted by Josh Chamas <jo...@chamas.com>.
Thanos Chatziathanassiou wrote:
> Josh Chamas wrote:
> 
>> Maybe the Content-Length field is not calculated correctly by 
>> Apache::ASP?
>> If its too short, this could be a problem.  To have Apache::ASP not 
>> calculate
>> the content length, try to flush when the script first starts, that 
>> will flush
>> the headers, and see if there is a difference.
>>
>> Regards,
>>
>> Josh
>>
> Sure enough, the Content-Length header was miscalculated. Since I had 
> the ethereal capture handy, I could verify it rather easily (although 
> had I been careful, I would have been able to figure it out myself - 
> thanks for pointing it out).
> A ``$Response->Flush();'' fixed things for me.
> Still, in the beginning I wasn't using Script_OnFlush, I used a regular 
> global.asa sub on the database data directly and I still got the same 
> problem. Then I switched to Script_OnFlush to save myself the trouble of 
> changing some 200+ files one by one.

Great.  The Content-Length header is calculated like this in Response.pm:

   $self->{headers_out}->set('Content-Length', length($$out));

As you can see, it does not do anything but use perl's length() method.

So I wonder if you are using a perl that is UTF8 aware, like perl 5.8.x series?
Otherwise, is there anything you can do to make it aware like the LANG ENV
setting?

The only other thing I can think of is maybe we should not trust perl's
UTF8 handling generally for the length calculation, and if this is set:

   $Response->{ContentType} = 'text/html;charset=UTF-8'

then we simply do not calculate the Content-Length automatically, leaving
it as an exercise for the developer?

Regards,

Josh

________________________________________________________________________
Josh Chamas, Founder    | NodeWorks - http://www.nodeworks.com
Chamas Enterprises Inc. | NodeWorks Directory - http://dir.nodeworks.com
http://www.chamas.com   | Apache::ASP - http://www.apache-asp.org



---------------------------------------------------------------------
To unsubscribe, e-mail: asp-unsubscribe@perl.apache.org
For additional commands, e-mail: asp-help@perl.apache.org


Re: UTF-8 problems ....

Posted by Thanos Chatziathanassiou <tc...@arx.gr>.
Josh Chamas wrote:

> Maybe the Content-Length field is not calculated correctly by 
> Apache::ASP?
> If its too short, this could be a problem.  To have Apache::ASP not 
> calculate
> the content length, try to flush when the script first starts, that 
> will flush
> the headers, and see if there is a difference.
>
> Regards,
>
> Josh
>
Sure enough, the Content-Length header was miscalculated. Since I had 
the ethereal capture handy, I could verify it rather easily (although 
had I been careful, I would have been able to figure it out myself - 
thanks for pointing it out).
A ``$Response->Flush();'' fixed things for me.
Still, in the beginning I wasn't using Script_OnFlush, I used a regular 
global.asa sub on the database data directly and I still got the same 
problem. Then I switched to Script_OnFlush to save myself the trouble of 
changing some 200+ files one by one.
As far as the LANG is concerned, it has been correctly set to el_GR 
(meaning iso-8869-7) for as long as I can remember.
I recall having problems to set it, because I specifically created an 
asp page telling whether a few greek words contain \w characters and to 
convert them to upper case, just to see it works. I see now that I had 
to set $ENV{'LANG'} in startup.pl to get it to work correctly.

Thanks again for the quick response.

Regards,
Thanos Chatziathanassiou


---------------------------------------------------------------------
To unsubscribe, e-mail: asp-unsubscribe@perl.apache.org
For additional commands, e-mail: asp-help@perl.apache.org


Re: UTF-8 problems ....

Posted by Thanos Chatziathanassiou <tc...@arx.gr>.
Josh Chamas wrote:

> Maybe the Content-Length field is not calculated correctly by 
> Apache::ASP?
> If its too short, this could be a problem.  To have Apache::ASP not 
> calculate
> the content length, try to flush when the script first starts, that 
> will flush
> the headers, and see if there is a difference.
>
> Regards,
>
> Josh
>
Sure enough, the Content-Length header was miscalculated. Since I had
the ethereal capture handy, I could verify it rather easily (although
had I been careful, I would have been able to figure it out myself -
thanks for pointing it out).
A ``$Response->Flush();'' fixed things for me.
Still, in the beginning I wasn't using Script_OnFlush, I used a regular
global.asa sub on the database data directly and I still got the same
problem. Then I switched to Script_OnFlush to save myself the trouble of
changing some 200+ files one by one.
As far as the LANG is concerned, it has been correctly set to el_GR
(meaning iso-8869-7)**** for as long as I can remember.
I recall having problems to set it, because I specifically created an
asp page telling whether a few greek words contain \w characters and to
convert them to upper case, just to see it works. I see now that I had
to set $ENV{'LANG'} in startup.pl to get it to work correctly.

Thanks again for the quick response.

Regards,
Thanos Chatziathanassiou


**** sorry, I really meant to say *ISO-8859-7*

---------------------------------------------------------------------
To unsubscribe, e-mail: asp-unsubscribe@perl.apache.org
For additional commands, e-mail: asp-help@perl.apache.org


Re: UTF-8 problems ....

Posted by Josh Chamas <jo...@chamas.com>.
Thanos Chatziathanassiou wrote:
> I have an interesting problem with UTF-8 charset and Apache::ASP 
> (possibly):
> 
> I had to construct some xml with Apache::ASP some time ago, but due to a 
> switch, I now want the resulting file to be UTF-8 encoded instead of 
> ISO-8859-7 that it was until now.
> This is all fine, however the data I'm reading from the database are 
> also in ISO-8859-7.
> So I used Script_OnFlush like this:
> 
> $$ref =~ s|([\xB8-\xFE])|chr(ord($1)+0x02D0)|sge;
> (BTW I also tried ``use Encode;'' and ``use encoding "iso-8859-7";'' of 
> perl-5.8 with quite the same results)
> in order to convert all greek characters from iso-8859-7 to utf-8 just 
> before flushing the output to the client.
> 
> The problem is that although I can verify that $$ref contains what I 
> want (I also printed it to a file just to make sure), the output to any 
> client (Mozilla, Opera, IE, XML Spy, whatever) is truncated and 
> obviously not valid.
> I've actually used Ethereal to sniff the data on the network and they 
> seem to be valid and the charset checks out ok.
> 

Maybe the Content-Length field is not calculated correctly by Apache::ASP?
If its too short, this could be a problem.  To have Apache::ASP not calculate
the content length, try to flush when the script first starts, that will flush
the headers, and see if there is a difference.

Regards,

Josh

________________________________________________________________________
Josh Chamas, Founder    | NodeWorks - http://www.nodeworks.com
Chamas Enterprises Inc. | NodeWorks Directory - http://dir.nodeworks.com
http://www.chamas.com   | Apache::ASP - http://www.apache-asp.org



---------------------------------------------------------------------
To unsubscribe, e-mail: asp-unsubscribe@perl.apache.org
For additional commands, e-mail: asp-help@perl.apache.org