You are viewing a plain text version of this content. The canonical link for it is here.
Posted to embperl@perl.apache.org by Jean-Christophe Boggio <em...@thefreecat.org> on 2010/04/21 17:20:46 UTC

Encoding problem

Hello,

I have problems with the encoding of posted form data. I try to do everything
in UTF-8 (code, DB, html...).

I have a form on a page where the data IS utf-8 (that's what I think) but
it does not have the UTF-8 bit set, wonder why.

Firefox detects the page encoding as Unicode (UTF-8). The page has this header :
<meta http-equiv="content-type" content="text/html; charset=utf-8">

But if I "print OUT $fdat{myfield}" it gets re-encoded in UTF-8 (ie: I get two
chars like Äç for every accented letter)

The following code makes the page work but I don't understand why I have to
do the work manually :

foreach my $k(keys %fdat) {
	Encode::_utf8_on($fdat{$k});
}

My apache2 conf is like this :

         <Directory /var/www/sites/dynatouraine>
                 Options Indexes FollowSymLinks MultiViews
                 AllowOverride None
                 Order allow,deny
                 allow from all
                 EMBPERL_APPNAME         DynaTouraine
                 EMBPERL_OBJECT_BASE     base.epl
                 EMBPERL_ESCMODE         0
                 <Files *.html>
                         SetHandler      perl-script
                         PerlHandler     Embperl::Object
                         Options         ExecCGI
                 </Files>
         </Directory>

Thanks for your help.

PS: Embperl 2.2.0-3.1 on Debian/Lenny 5.0.4 with apache 2.2.9-10+lenny6

-- 
Jean-Christophe Boggio                       -o)
embperl@thefreecat.org                       /\\
Independant Consultant and Developer        _\_V

---------------------------------------------------------------------
To unsubscribe, e-mail: embperl-unsubscribe@perl.apache.org
For additional commands, e-mail: embperl-help@perl.apache.org


RE: Encoding problem

Posted by Gerald Richter - ECOS <ge...@ecos.de>.
Hi,

I have UTF8 pages where I remember that thing are handled correctly, but I have to dig a little bit deeper to find out what the difference to your example is. I am currently on a business trip and hope to get the time to look at it near the end of the week

Gerald




> -----Original Message-----
> From: Jean-Christophe Boggio [mailto:embperl@thefreecat.org]
> Sent: Monday, April 26, 2010 12:16 PM
> To: embperl@perl.apache.org
> Subject: Re: Encoding problem
> 
> Hi,
> 
> Since I seem to be the only one having problems with utf8 forms, I
> guess
> the problem is me not expecting the correct things to happen.
> 
> The following is a simple html test page with a simple form. I expect
> the
> result to be utf-8 but it's not (until I comment out the
> Encode::_utf8_on() line).
> 
> Is this normal ? Do you have the same behaviour ? Can someone explain
> (or point me
> to a doc explaining) the confusion I'm making ?
> 
> Thanks for your help,
> 
> 
> 
> <!doctype html>
> <html><head>
> [-
> 	use utf8;
> 	use Encode;
> 
> # Encode::_utf8_on($fdat{$_}) for keys %fdat;
> 
> $escmode=0;
> $http_headers_out{'Content-Type'}="text/html; charset=utf-8";
> -]
> </head>
> 
> <body>
> 	[+ utf8::is_utf8($fdat{nom}) ? 'utf8' : 'other' +]
> 	<br />
> 	Received : [+ $fdat{nom} +]<br />
> 	<form method="post" accept-charset="UTF-8">
> 		<input type="text" id="nom" name="nom" />
> 		<input type="submit" value="go" />
> 	</form>
> </body>
> 
> </html>
> 
> 
> PS: In the same directory I have a base.epl file containing
> [- Execute('*'); -]
> 
> 
> --
> Jean-Christophe Boggio                       -o)
> embperl@thefreecat.org                       /\\
> Independant Consultant and Developer        _\_V
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: embperl-unsubscribe@perl.apache.org
> For additional commands, e-mail: embperl-help@perl.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: embperl-unsubscribe@perl.apache.org
For additional commands, e-mail: embperl-help@perl.apache.org


Re: Encoding problem

Posted by Jean-Christophe Boggio <em...@thefreecat.org>.
Hi,

Since I seem to be the only one having problems with utf8 forms, I guess
the problem is me not expecting the correct things to happen.

The following is a simple html test page with a simple form. I expect the
result to be utf-8 but it's not (until I comment out the Encode::_utf8_on() line).

Is this normal ? Do you have the same behaviour ? Can someone explain (or point me
to a doc explaining) the confusion I'm making ?

Thanks for your help,



<!doctype html>
<html><head>
[-
	use utf8;
	use Encode;
	
# Encode::_utf8_on($fdat{$_}) for keys %fdat;

$escmode=0;
$http_headers_out{'Content-Type'}="text/html; charset=utf-8";
-]
</head>

<body>
	[+ utf8::is_utf8($fdat{nom}) ? 'utf8' : 'other' +]
	<br />
	Received : [+ $fdat{nom} +]<br />
	<form method="post" accept-charset="UTF-8">
		<input type="text" id="nom" name="nom" />
		<input type="submit" value="go" />
	</form>
</body>

</html>


PS: In the same directory I have a base.epl file containing
[- Execute('*'); -]


-- 
Jean-Christophe Boggio                       -o)
embperl@thefreecat.org                       /\\
Independant Consultant and Developer        _\_V

---------------------------------------------------------------------
To unsubscribe, e-mail: embperl-unsubscribe@perl.apache.org
For additional commands, e-mail: embperl-help@perl.apache.org


Re: Encoding problem

Posted by Alexander Hartmaier <al...@t-systems.at>.
Sorry, seems I missed the point.

I'm using plain old Embperl, not Embperl::Object so it might be a
difference in there.

My old Embperl app works flawless with UTF-8.
Have you checked if your browser sends the data as UTF-8 with e.g.
tcpdump?

In general you shouldn't rely on Perl's utf-8 flag but en-/decode
according to the charset you expect.
I'm not sure if Embperl decodes request params by default into Perl's
internal utf-8 representation.

My app does successfully store German Umlauts into our Oracle database,
but I haven't checked Perl's internal utf-8 flag.

--
Best regards, Alex


Am Donnerstag, den 22.04.2010, 21:05 +0200 schrieb Jean-Christophe
Boggio:
> Hi Alexander,
>
> Alexander Hartmaier a écrit :
> > You should *always* return the correct charset in the http header, no
> > matter which framework/cgi script you're using.
>
> ? The problem comes from the header I *receive*. The headers I send are
> always good (hard coded in base.epl). I'm quoting myself :
>
> > Firefox detects the page encoding as Unicode (UTF-8). The page has this header :
> > <meta http-equiv="content-type" content="text/html; charset=utf-8">
>
> Do you suggest something else ? I'm sorry, I don't understand your point.
>
> I referred to other people having the same kind of problems just because
> it might not be a embperl-only problem but maybe an apache-perl
> problem.
>
> To make it short, the %fdat "fields" are coded in utf-8 but not seen
> by perl *as* utf-8.
>
> > AddDefaultCharset in apache is bad because it appends that header for
> > every resource which didn't specify it.
>
> I know, it was a "second chance" type of solution (suggested by Gerald).
> Though I don't want anything else than utf-8 so it doesn't harm.
>
> Any idea ?
>
> Thanks for your help,
>


*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*
T-Systems Austria GesmbH   Rennweg 97-99, 1030 Wien
Handelsgericht Wien, FN 79340b
*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*
Notice: This e-mail contains information that is confidential and may be privileged.
If you are not the intended recipient, please notify the sender and then
delete this e-mail immediately.
*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*

---------------------------------------------------------------------
To unsubscribe, e-mail: embperl-unsubscribe@perl.apache.org
For additional commands, e-mail: embperl-help@perl.apache.org


Re: Encoding problem

Posted by André Warnier <aw...@ice-sa.com>.
Emmanuel CROMBEZ wrote:
> Hello
> 
> My solution for UTF-8 problem in web page are in 3 steps:
> 1 - fixe content-type
> 2 - fixe xml encoding
> 3 - fixe html header encoding
> 
> In mod_perl , use
>   
>   $r->content_type('text/html; Charset=UTF-8');
> 
> The first line of your html page must be :
> 
>   <?xml version="1.0" encoding="UTF-8" ?>
> 
> And in the <head> you must have :
> 
>   <meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
> 
The above is all good, but you should also add what koliko wrote :
  <form action="xx.html"  method="post" accept-charset="UTF-8">

.. but this is still not 100% foolproof, unfortunately.
There are still several aspects that can still give a problem :

1) according to the HTTP specification, the request URL (of which a 
query_string is a part, for a GET) does not have any particular 
encoding. That means that the proper decoding of that query string 
require some kind of agreement between the client and the server.

2) in the <form> tag above, the method is POST.  That means that the 
data will arrive in the body of the request.  There are 2 ways of 
encoding POST data (or rather, to present it) :
- www-form-urlencoded (the default)
- multipart/form-data
To specifiy which method the browser should use, you should have an 
additional attribute in the <form> tag, e.g. :
  <form action="xx.html"  method="post" accept-charset="UTF-8" 
enctype="multipart/form-data">
Theoretically, in the multipart/form-data format, each form parameter is 
submitted in a separate "section" of the data, a bit like an email with 
attachments. And each part should have a Content-type header, with a 
charset.  Unfortunately, the last time I looked, browsers do not do 
specify the charset for form parameters.  (That is a real pity, because 
that would be the right solution.)

3) Finally, no matter what you do at the server side, ultimately you are 
sending this to a browser on the client side.  And the ultimate master 
of the browser is the user who sits in front of it.  If the user wants 
to change the browser settings (including the charset of your page) he can.
The user can also be using a bad browser (who decides itself how your 
page should be interpreted), or a program that is not a browser, but 
just simulates one (think curl, wget, lwp-request).

An example : on the server side, save a html page with the MS Notepad, 
as UTF-8.  Notepad then automatically adds a "BOM" at the beginning of 
the file.  Now send this page to IE.  It does not matter which charset 
you set in the HTTP headers, or in the page's <meta> tags, IE will look 
at the BOM and decide that this is UTF-8. Always.

(IE also has a setting : "send all URLs as UTF-8").

So I add yet another gimmick to my form pages : a hidden field 
containing a known "accented" character sequence.  Then when the 
parameters of the form are posted to the server, the perl code on the 
server side checks the length (in bytes) of this parameter. If it 
matches the expected byte length and value of the hidden field, then 
chances are that everything is OK. If not, something funny happened.
Of course, a user really out to get you can also save the form, edit the 
hidden field, and submit the modified form to your server.

Everything that can happen, will happen at some time.  It is just a 
matter of how much the incentive is to do it.















> If you doesn't have this 3 lines, some part of your page works and other
> doesn't. For exemple if you don't set the <?xml>, the title of your page
> doesn't work all the time.
> 
> I write article in my blog (in french) here : 
> http://ecrombez.lantrasite.com/index.pl?PAGE=53&ID_BILLET=115
> 
> Le vendredi 23 avril 2010 à 09:27 +0200, kolikov a écrit :
>> Jean-Christophe Boggio wrote:
>>> ? The problem comes from the header I *receive*. The headers I send are
>>> always good (hard coded in base.epl). I'm quoting myself :
>>>
>>>> <meta http-equiv="content-type" content="text/html; charset=utf-8">
>> If it may help :
>>
>> All my scripts are written in utf-8 encoding
>> My default system/database locales are utf-8
>> My apache2.conf is the default one
>>
>> # cat /etc/apache2/sites-available/mysite
>> <Directory /var/www/mysite>
>>   AddDefaultCharset utf-8
>> ETC ....
>> </Directory>
>>
>> My html headers are the same as yours.
>> But I put on Every <form>
>>
>> <form action="xx.html"  method="post" accept-charset="UTF-8">
>>
>> Which may make the point ...
>>
>> Bregs,
>> Romu.
>>
> 
> 
> 


Re: Encoding problem

Posted by Emmanuel CROMBEZ <ec...@lanthrasites.com>.
Hello

My solution for UTF-8 problem in web page are in 3 steps:
1 - fixe content-type
2 - fixe xml encoding
3 - fixe html header encoding

In mod_perl , use
  
  $r->content_type('text/html; Charset=UTF-8');

The first line of your html page must be :

  <?xml version="1.0" encoding="UTF-8" ?>

And in the <head> you must have :

  <meta http-equiv="Content-Type" content="text/html;charset=UTF-8">

If you doesn't have this 3 lines, some part of your page works and other
doesn't. For exemple if you don't set the <?xml>, the title of your page
doesn't work all the time.

I write article in my blog (in french) here : 
http://ecrombez.lantrasite.com/index.pl?PAGE=53&ID_BILLET=115

Le vendredi 23 avril 2010 à 09:27 +0200, kolikov a écrit :
> Jean-Christophe Boggio wrote:
> > ? The problem comes from the header I *receive*. The headers I send are
> > always good (hard coded in base.epl). I'm quoting myself :
> >
> >> <meta http-equiv="content-type" content="text/html; charset=utf-8">
> 
> If it may help :
> 
> All my scripts are written in utf-8 encoding
> My default system/database locales are utf-8
> My apache2.conf is the default one
> 
> # cat /etc/apache2/sites-available/mysite
> <Directory /var/www/mysite>
>   AddDefaultCharset utf-8
> ETC ....
> </Directory>
> 
> My html headers are the same as yours.
> But I put on Every <form>
> 
> <form action="xx.html"  method="post" accept-charset="UTF-8">
> 
> Which may make the point ...
> 
> Bregs,
> Romu.
> 



Re: Encoding problem

Posted by kolikov <ko...@free.fr>.
Jean-Christophe Boggio wrote:
> ? The problem comes from the header I *receive*. The headers I send are
> always good (hard coded in base.epl). I'm quoting myself :
>
>> <meta http-equiv="content-type" content="text/html; charset=utf-8">

If it may help :

All my scripts are written in utf-8 encoding
My default system/database locales are utf-8
My apache2.conf is the default one

# cat /etc/apache2/sites-available/mysite
<Directory /var/www/mysite>
  AddDefaultCharset utf-8
ETC ....
</Directory>

My html headers are the same as yours.
But I put on Every <form>

<form action="xx.html"  method="post" accept-charset="UTF-8">

Which may make the point ...

Bregs,
Romu.

-- 
Nuguet romuald : kolikov@free.fr



Re: Encoding problem

Posted by Jean-Christophe Boggio <em...@thefreecat.org>.
Hi Alexander,

Alexander Hartmaier a écrit :
> You should *always* return the correct charset in the http header, no
> matter which framework/cgi script you're using.

? The problem comes from the header I *receive*. The headers I send are
always good (hard coded in base.epl). I'm quoting myself :

> Firefox detects the page encoding as Unicode (UTF-8). The page has this header :
> <meta http-equiv="content-type" content="text/html; charset=utf-8">

Do you suggest something else ? I'm sorry, I don't understand your point.

I referred to other people having the same kind of problems just because
it might not be a embperl-only problem but maybe an apache-perl
problem.

To make it short, the %fdat "fields" are coded in utf-8 but not seen
by perl *as* utf-8.

> AddDefaultCharset in apache is bad because it appends that header for
> every resource which didn't specify it.

I know, it was a "second chance" type of solution (suggested by Gerald).
Though I don't want anything else than utf-8 so it doesn't harm.

Any idea ?

Thanks for your help,

-- 
Jean-Christophe Boggio                       -o)
embperl@thefreecat.org                       /\\
Independant Consultant and Developer        _\_V

---------------------------------------------------------------------
To unsubscribe, e-mail: embperl-unsubscribe@perl.apache.org
For additional commands, e-mail: embperl-help@perl.apache.org


Re: Encoding problem

Posted by Alexander Hartmaier <al...@t-systems.at>.
You should *always* return the correct charset in the http header, no
matter which framework/cgi script you're using.

AddDefaultCharset in apache is bad because it appends that header for
every resource which didn't specify it.

--
Best regards, Alex


Am Donnerstag, den 22.04.2010, 02:53 +0200 schrieb Jean-Christophe
Boggio:
> Hi Gerald,
>
> Gerald Richter - ECOS a écrit :
> > setting the default encoding in the httpd.conf to utf8 might help
>
> I already have :
>    AddDefaultCharset UTF-8
> in my httpd.conf.
>
> I tried to add it to my <directory.../> directives and also
>    AddCharset utf-8 .html
> With no more luck.
>
> I found other people describing this kind of symptom, one using CGIs :
> http://mail-archives.apache.org/mod_mbox/perl-modperl/200806.mbox/%3C485EA6FF.7090104@ice-sa.com%3E
>
> Another with Mason :
> http://www.cybaea.net/Blogs/TechNotes/Mason-utf-8-clean.html#h2_form_input
>
> Both use decode() functions (which works for me too) but I guess they are
> converting the encoding back and forth... I don't even know if this is related.
>
> Are we supposed to get utf8-stamped $fdat{xx} variables when the input is accentuated
> and the form/page are utf-8 ?
>
> Thanks for your help,
>
> PS: I rewrote the "fix" in a more "monky" way :
>
> Encode::_utf8_on($fdat{$_}) for keys %fdat;
>


*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*
T-Systems Austria GesmbH   Rennweg 97-99, 1030 Wien
Handelsgericht Wien, FN 79340b
*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*
Notice: This e-mail contains information that is confidential and may be privileged.
If you are not the intended recipient, please notify the sender and then
delete this e-mail immediately.
*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*"*

---------------------------------------------------------------------
To unsubscribe, e-mail: embperl-unsubscribe@perl.apache.org
For additional commands, e-mail: embperl-help@perl.apache.org


Re: Encoding problem

Posted by Jean-Christophe Boggio <em...@thefreecat.org>.
Hi Gerald,

Gerald Richter - ECOS a écrit :
> setting the default encoding in the httpd.conf to utf8 might help

I already have :
   AddDefaultCharset UTF-8
in my httpd.conf.

I tried to add it to my <directory.../> directives and also
   AddCharset utf-8 .html
With no more luck.

I found other people describing this kind of symptom, one using CGIs :
http://mail-archives.apache.org/mod_mbox/perl-modperl/200806.mbox/%3C485EA6FF.7090104@ice-sa.com%3E

Another with Mason :
http://www.cybaea.net/Blogs/TechNotes/Mason-utf-8-clean.html#h2_form_input

Both use decode() functions (which works for me too) but I guess they are
converting the encoding back and forth... I don't even know if this is related.

Are we supposed to get utf8-stamped $fdat{xx} variables when the input is accentuated
and the form/page are utf-8 ?

Thanks for your help,

PS: I rewrote the "fix" in a more "monky" way :

Encode::_utf8_on($fdat{$_}) for keys %fdat;

-- 
Jean-Christophe Boggio                       -o)
embperl@thefreecat.org                       /\\
Independant Consultant and Developer        _\_V

---------------------------------------------------------------------
To unsubscribe, e-mail: embperl-unsubscribe@perl.apache.org
For additional commands, e-mail: embperl-help@perl.apache.org


RE: Encoding problem

Posted by Gerald Richter - ECOS <ge...@ecos.de>.
Hi,

setting the default encoding in the httpd.conf to utf8 might help


Gerald


> -----Original Message-----
> From: Jean-Christophe Boggio [mailto:embperl@thefreecat.org]
> Sent: Wednesday, April 21, 2010 5:21 PM
> To: embperl@perl.apache.org
> Subject: Encoding problem
> 
> Hello,
> 
> I have problems with the encoding of posted form data. I try to do
> everything
> in UTF-8 (code, DB, html...).
> 
> I have a form on a page where the data IS utf-8 (that's what I think)
> but
> it does not have the UTF-8 bit set, wonder why.
> 
> Firefox detects the page encoding as Unicode (UTF-8). The page has this
> header :
> <meta http-equiv="content-type" content="text/html; charset=utf-8">
> 
> But if I "print OUT $fdat{myfield}" it gets re-encoded in UTF-8 (ie: I
> get two
> chars like Äç for every accented letter)
> 
> The following code makes the page work but I don't understand why I
> have to
> do the work manually :
> 
> foreach my $k(keys %fdat) {
> 	Encode::_utf8_on($fdat{$k});
> }
> 
> My apache2 conf is like this :
> 
>          <Directory /var/www/sites/dynatouraine>
>                  Options Indexes FollowSymLinks MultiViews
>                  AllowOverride None
>                  Order allow,deny
>                  allow from all
>                  EMBPERL_APPNAME         DynaTouraine
>                  EMBPERL_OBJECT_BASE     base.epl
>                  EMBPERL_ESCMODE         0
>                  <Files *.html>
>                          SetHandler      perl-script
>                          PerlHandler     Embperl::Object
>                          Options         ExecCGI
>                  </Files>
>          </Directory>
> 
> Thanks for your help.
> 
> PS: Embperl 2.2.0-3.1 on Debian/Lenny 5.0.4 with apache 2.2.9-10+lenny6
> 
> --
> Jean-Christophe Boggio                       -o)
> embperl@thefreecat.org                       /\\
> Independant Consultant and Developer        _\_V
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: embperl-unsubscribe@perl.apache.org
> For additional commands, e-mail: embperl-help@perl.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: embperl-unsubscribe@perl.apache.org
For additional commands, e-mail: embperl-help@perl.apache.org