You are viewing a plain text version of this content. The canonical link for it is here.

Posted to modperl@perl.apache.org by angie ahl <an...@gmail.com> on 2005/05/18 21:39:28 UTC

Baffling unicode wierdness

Hi List

I've been pottering away trying to get utf-8 behaving on my set up and
have nearly got there but then the client phoned up saying that the £
symbol was being displayed as a ?

The first page contains several languages and a £ sign and all is
displayed fine.

http://perl.wtsbroadcast.com/about/Angies_test_page.html


The second is the same as the first but without all the extra language
stuff. There the £ displays as a ?.

http://perl.wtsbroadcast.com/about/Angies_second_test_page.html


Very wierd. Same code generated both pages. To explain the whole set
up would take a *long* time but I wondered if anyone else had seen
this?

	$v = $q->url_param('fieldname');

	my $decoder = Encode::Guess->guess($v);
	ref($decoder) or warn "Can't guess for $v: $decoder"; # trap error this way
	if (ref($decoder)) {
		my $utf8 = $decoder->decode($v) ? $decoder->decode($v) : $v;
		$params{$uarg} = $utf8;
	}
	else {
		$params{$uarg} = decode("utf8", $v) ? decode("utf8", $v) : $v;
	}


I just can't fathom this.


MP1/Apache 1 on fedora core 2.

Re: Baffling unicode wierdness

Posted by "Graeme St.Clair" <gr...@atlanticbb.net>.

I would throw the sterling sign out of the source document, and substitute 
&pound; or &#xa3; or &#163; (semi-colon is important!).  I think that would 
probably work across all platforms and browsers.

HTH, rgds, GStC.

----- Original Message ----- 
From: "angie ahl" <an...@gmail.com>
To: <mo...@perl.apache.org>; <be...@perl.org>
Sent: Wednesday, May 18, 2005 3:39 PM
Subject: Baffling unicode wierdness

Hi List

I've been pottering away trying to get utf-8 behaving on my set up and
have nearly got there but then the client phoned up saying that the £
symbol was being displayed as a ?

The first page contains several languages and a £ sign and all is
displayed fine.

http://perl.wtsbroadcast.com/about/Angies_test_page.html

The second is the same as the first but without all the extra language
stuff. There the £ displays as a ?.

http://perl.wtsbroadcast.com/about/Angies_second_test_page.html

Very wierd. Same code generated both pages. To explain the whole set
up would take a *long* time but I wondered if anyone else had seen
this?

$v = $q->url_param('fieldname');

my $decoder = Encode::Guess->guess($v);
ref($decoder) or warn "Can't guess for $v: $decoder"; # trap error this way
if (ref($decoder)) {
my $utf8 = $decoder->decode($v) ? $decoder->decode($v) : $v;
$params{$uarg} = $utf8;
}
else {
$params{$uarg} = decode("utf8", $v) ? decode("utf8", $v) : $v;
}

I just can't fathom this.

MP1/Apache 1 on fedora core 2.

-- 
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
<http://learn.perl.org/> <http://learn.perl.org/first-response>

Re: Baffling unicode wierdness

Posted by Markus Wichitill <ma...@gmx.de>.

angie ahl wrote:
> It looks as though the browser isn't sending the data as UTF-8 unless
> it contains text that has to be. As soon as I add a € or some other
> character that's utf-8 it comes through fine.

I've never seen any browser send anything but UTF-8 if the page was marked 
as UTF-8.

>>        my $decoder = Encode::Guess->guess($v);

Try to get rid of Encode::Guess, just use Encode::decode_utf8() or 
utf8::decode(). Guessing isn't necessary for normal browser input, and can 
only add problems by mis-guessing.

Dealing with UTF-8-flagged strings in Perl can be very complicated and 
error-prone, but in my experience, it works when handled correctly.

Re: Baffling unicode wierdness

Posted by Markus Wichitill <ma...@gmx.de>.

> Just for the record it was the browser passing the form params as
> Latin unless there was a character that couldn't be represented in
> Latin. Then it would do as it was told and pass it as utf-8

Can you show either the actual webpage with the form or a simplified test 
case of it? Because I'm still pretty sure browsers don't do that if the page 
is correct. Certainly not Firefox, which I also use and which behaves just 
fine in my own UTF-8 applications, even if I only submit ASCII and umlauts 
that could be represented in Latin1, but no characters > 256.

BTW, you can check what exactly Firefox submits by using the very useful 
LiveHTTPHeaders extension from http://livehttpheaders.mozdev.org. And/or you 
could check browsers against one of my UTF-8-capable applications at 
http://www.mwforum.org.

Re: Baffling unicode wierdness

Posted by angie ahl <an...@gmail.com>.

On 5/19/05, Randy Kobes <ra...@theoryx5.uwinnipeg.ca> wrote:
> On Wed, 18 May 2005, Jay Savage wrote:
> 
> > On 5/18/05, angie ahl <an...@gmail.com> wrote:
> > > I can confirm that it's happening before the data's gone
> > > to the database or anything. I'm getting the params from
> > > CGI.pm and then decoding via decode("utf8", $v) The page
> > > the params came from is set as utf-8 in the http header
> > > and> content type and firefox is believing the page is
> > > utf-8.> > It looks as though the browser isn't sending
> > > the data as UTF-8 unless> it contains text that has to
> > > be. As soon as I add a € or some other character that's
> > > utf-8 it comes through fine. Checking the params before
> > > it's decoded showed the £ as I expected to see it after
> > > if had been decoded leading me to think the form hasn't
> > > been passed as utf-8 . Any clues.....  anyone?
> 
> > That sounds about right.  Most (english) browsers default
> > to Latin-1even when they say they don't.  Make sure you
> > have "enctype" set inthe opening form tag. If it still
> > doesn't work, you'll need to figureout (or as the client)
> > what the encoding is, and translate itmanipulating the
> > layers and/or encodings. But the bottom line is: if you're
> > not putting utf-8 in at some point,you won't get utf-8
> > out.
> 
> For
>  http://perl.wtsbroadcast.com/about/Angies_second_test_page.html
> if (in Firefox on Win32) I set
>    View -> Character Encoding -> Western (Windows-1252)
> I get the £ displayed.
> 
> --
> best regards,
> randy kobes
> 

Just for the record it was the browser passing the form params as
Latin unless there was a character that couldn't be represented in
Latin. Then it would do as it was told and pass it as utf-8

in the end I had to use Encode::Guess to see if it was utf-8 if so
decode as that otherwise decode as iso-8859-1.

To make it a tiny bit more stable, and after a lot of trial and error
I ended up doing this.

1.Concat all the values that were passed in the form into one string.
2.Run Encode::Guess on that in order to give it enough data to have a
fair crack at it.
If $decoder is set use it to decode for values, otherwise use iso-8859-1.

Not very pretty I grant you, but the only thing that does actually
work seeing as the browser wont pass values as utf-8 all the time. Or
maybe it's the OS that's entering the text as  iso-8859-1.

HTH someone someday.

Re: Baffling unicode wierdness

Posted by Randy Kobes <ra...@theoryx5.uwinnipeg.ca>.

On Wed, 18 May 2005, Jay Savage wrote:

> On 5/18/05, angie ahl <an...@gmail.com> wrote:
> > I can confirm that it's happening before the data's gone
> > to the database or anything. I'm getting the params from
> > CGI.pm and then decoding via decode("utf8", $v) The page
> > the params came from is set as utf-8 in the http header
> > and> content type and firefox is believing the page is
> > utf-8.> > It looks as though the browser isn't sending
> > the data as UTF-8 unless> it contains text that has to
> > be. As soon as I add a € or some other character that's
> > utf-8 it comes through fine. Checking the params before
> > it's decoded showed the £ as I expected to see it after
> > if had been decoded leading me to think the form hasn't
> > been passed as utf-8 . Any clues.....  anyone?

> That sounds about right.  Most (english) browsers default
> to Latin-1even when they say they don't.  Make sure you
> have "enctype" set inthe opening form tag. If it still
> doesn't work, you'll need to figureout (or as the client)
> what the encoding is, and translate itmanipulating the
> layers and/or encodings. But the bottom line is: if you're
> not putting utf-8 in at some point,you won't get utf-8
> out.

For
 http://perl.wtsbroadcast.com/about/Angies_second_test_page.html
if (in Firefox on Win32) I set
   View -> Character Encoding -> Western (Windows-1252)
I get the £ displayed.

-- 
best regards,
randy kobes

Re: Baffling unicode wierdness

Posted by Jay Savage <da...@gmail.com>.

On 5/18/05, angie ahl <an...@gmail.com> wrote:
> I can confirm that it's happening before the data's gone to the
> database or anything.
> 
> I'm getting the params from CGI.pm and then decoding via decode("utf8", $v)
> 
> The page the params came from is set as utf-8 in the http header and
> content type and firefox is believing the page is utf-8.
> 
> It looks as though the browser isn't sending the data as UTF-8 unless
> it contains text that has to be. As soon as I add a € or some other
> character that's utf-8 it comes through fine.
> 
> Checking the params before it's decoded showed the £ as I expected to
> see it after if had been decoded leading me to think the form hasn't
> been passed as utf-8 .
> 
> Any clues..... anyone?
> 

That sounds about right.  Most (english) browsers default to Latin-1
even when they say they don't.  Make sure you have "enctype" set in
the opening form tag. If it still doesn't work, you'll need to figure
out (or as the client) what the encoding is, and translate it
manipulating the layers and/or encodings.

But the bottom line is: if you're not putting utf-8 in at some point,
you won't get utf-8 out.

--jay

Re: Baffling unicode wierdness

Posted by angie ahl <an...@gmail.com>.

I can confirm that it's happening before the data's gone to the
database or anything.

I'm getting the params from CGI.pm and then decoding via decode("utf8", $v)

The page the params came from is set as utf-8 in the http header and
content type and firefox is believing the page is utf-8.

It looks as though the browser isn't sending the data as UTF-8 unless
it contains text that has to be. As soon as I add a € or some other
character that's utf-8 it comes through fine.

Checking the params before it's decoded showed the £ as I expected to
see it after if had been decoded leading me to think the form hasn't
been passed as utf-8 .

Any clues..... anyone?


On 5/18/05, angie ahl <an...@gmail.com> wrote:
> Hi List
> 
> I've been pottering away trying to get utf-8 behaving on my set up and
> have nearly got there but then the client phoned up saying that the £
> symbol was being displayed as a ?
> 
> The first page contains several languages and a £ sign and all is
> displayed fine.
> 
> http://perl.wtsbroadcast.com/about/Angies_test_page.html
> 
> The second is the same as the first but without all the extra language
> stuff. There the £ displays as a ?.
> 
> http://perl.wtsbroadcast.com/about/Angies_second_test_page.html
> 
> Very wierd. Same code generated both pages. To explain the whole set
> up would take a *long* time but I wondered if anyone else had seen
> this?
> 
>         $v = $q->url_param('fieldname');
> 
>         my $decoder = Encode::Guess->guess($v);
>         ref($decoder) or warn "Can't guess for $v: $decoder"; # trap error this way
>         if (ref($decoder)) {
>                 my $utf8 = $decoder->decode($v) ? $decoder->decode($v) : $v;
>                 $params{$uarg} = $utf8;
>         }
>         else {
>                 $params{$uarg} = decode("utf8", $v) ? decode("utf8", $v) : $v;
>         }
> 
> I just can't fathom this.
> 
> MP1/Apache 1 on fedora core 2.
>