You are viewing a plain text version of this content. The canonical link for it is here.
Posted to modperl@perl.apache.org by Eli Shemer <ap...@netvision.net.il> on 2008/03/19 13:06:59 UTC
utf8 urls
Hey there
For some reason the following test doesn’t print anything out to the screen
Do I need to change something in the apache configuration, or mod_perl’s ?
/articles_read.pl?id=חוזרת
## get http parameters
$r = shift;
$apr = Apache2::Request->new($r);
print $apr->param('id');
thanks in advance.
Internal Virus Database is out-of-date.
Checked by AVG Free Edition.
Version: 7.5.503 / Virus Database: 269.16.4/1146 - Release Date: 22/11/2007
18:55
Re: utf8 urls
Posted by André Warnier <aw...@ice-sa.com>.
From a previous message by Adam Prime in this same list :
[...]
SetHandler modperl doesn't bind 'print' to '$r->print'. Try SetHandler
perl-script, or change your code to pass in the request object and use
$r->print instead of print.
[...]
or, more verbously and explicitly :
if in your Apache configuration for this "location", you used
SetHandler modperl
then, you should not assume that print() sends its output to the
browser. But if you did (like you did)
$r = shift; # get the Apache::RequestRec object
then $r->print() does go back as a response to the browser.
You should probably at least set a content-type header though,
like
$r->content_type('text/plain');
$r->print $apr->param('id');
and, in your case, it might also be a good idea to send back a header
indicating which is the character set used (presumably UTF-8), since the
default HTTP character set is iso-8859-1, and the string you send back
doesn't look as being printable in that charset.
But I don't know exactly how to do that best in mod_perl.
Would the following work ?
$r->content_type('text/plain; charset="UTF-8"');
Also, the previous message talking about how to handle your (apparently)
UTF-8 request should be taken into account.
André
Eli Shemer wrote:
> Hey there
>
>
>
> For some reason the following test doesn’t print anything out to the screen
>
> Do I need to change something in the apache configuration, or mod_perl’s ?
>
>
>
> /articles_read.pl?id=חוזרת
>
>
>
> ## get http parameters
>
> $r = shift;
>
> $apr = Apache2::Request->new($r);
>
> print $apr->param('id');
>
>
>
>
>
> thanks in advance.
>
>
>
>
> Internal Virus Database is out-of-date.
> Checked by AVG Free Edition.
> Version: 7.5.503 / Virus Database: 269.16.4/1146 - Release Date: 22/11/2007
> 18:55
>
>
Re: utf8 urls
Posted by John ORourke <jo...@o-rourke.org>.
Geoffrey Young wrote:
> John ORourke wrote:
>> Eli Shemer wrote:
>>>
>>> For some reason the following test doesn’t print anything out to the
>>> screen
>>>
>> I'm not sure why you get nothing, but I can tell you strings read
>> from Apache objects come through as octets and need to be decoded
>> before use. We're using UTF-8 chars in URLs but I've never used one
>> in a GET request parameter.
>
> I can't say why it doesn't work, but I'm surprised it would in either
> case - the only characters explicitly allowed in a uri are us-ascii.
> from rfc2396:
>
My bad memory there - you are quite correct. The way we do it is the
accepted way - to URL-encode the UTF-8 encoded text, and that will work
with URLs and parameters.
eg:
http://www....../categories/name/ty%C3%B6kalut-lamput
is the correct form of:
http://www....../categories/name/työkalut-lamput
encode before printing:
$octets = utf8_encode($my_utf8_string); # make octets
$octets =~ s/([^\041-\177])/sprintf("%%%02X",ord($1))/ge; # URL-encode
non-ASCII chars
$r->print($octets);
(the above is simplified - you'll also need to encode question marks etc)
decode after reading:
$url = utf8_decode ( $r->uri() );
or
$param = utf8_decode ( $r->param('info') );
cheers
John
Re: utf8 urls
Posted by Geoffrey Young <ge...@modperlcookbook.org>.
John ORourke wrote:
> Eli Shemer wrote:
>>
>> For some reason the following test doesn’t print anything out to the
>> screen
>>
>> Do I need to change something in the apache configuration, or
>> mod_perl’s ?
>>
>>
>>
>> /articles_read.pl?id=חוזרת
>>
>>
>>
>> ## get http parameters
>>
>> $r = shift;
>>
>> $apr = Apache2::Request->new($r);
>>
>> print $apr->param('id');
>>
>
> I'm not sure why you get nothing, but I can tell you strings read from
> Apache objects come through as octets and need to be decoded before
> use. We're using UTF-8 chars in URLs but I've never used one in a GET
> request parameter.
I can't say why it doesn't work, but I'm surprised it would in either
case - the only characters explicitly allowed in a uri are us-ascii.
from rfc2396:
2.4. Escape Sequences
Data must be escaped if it does not have a representation using an
unreserved character; this includes data that does not correspond to
a printable character of the US-ASCII coded character set, or that
corresponds to any US-ASCII character that is disallowed, as
explained below.
I bit of googling turned up this cpan module:
http://search.cpan.org/dist/URI-Find-UTF8/lib/URI/Find/UTF8.pm
where the docs point to a ja.wikipedia.org page. for me (firefox 2.0)
clicking on the "original" uri (the one with the japanese characters)
opens up a uri with the uri-escaped character sequence. it's like magic ;)
anyway, my point wasn't to get into some huge debate on whether people
are (successfully) using utf-8 characters in uris, etc. rather, it is
that mod_perl is (mostly) merely a wrapper around apache, and if
something is improper wrt an official rfc apache generally dismisses it
rather than bending to a behavior which people may be using anyway.
so, if it works, great. if not, try making your urls conform to 2396
and see if you have better results.
--Geoff
Re: utf8 urls
Posted by John ORourke <jo...@o-rourke.org>.
Eli Shemer wrote:
>
> For some reason the following test doesn’t print anything out to the
> screen
>
> Do I need to change something in the apache configuration, or mod_perl’s ?
>
>
>
> /articles_read.pl?id=חוזרת
>
>
>
> ## get http parameters
>
> $r = shift;
>
> $apr = Apache2::Request->new($r);
>
> print $apr->param('id');
>
I'm not sure why you get nothing, but I can tell you strings read from
Apache objects come through as octets and need to be decoded before
use. We're using UTF-8 chars in URLs but I've never used one in a GET
request parameter.
hope that helps,
John
>
>
>
>
> thanks in advance.
>
>
>
>
> Internal Virus Database is out-of-date.
> Checked by AVG Free Edition.
> Version: 7.5.503 / Virus Database: 269.16.4/1146 - Release Date:
> 22/11/2007 18:55
>
Re: utf8 urls
Posted by André Warnier <aw...@ice-sa.com>.
I think that these things can get very confused and confusing very
quickly, unless one steps through them one step at a time.
Let me try a first iteration :
1) URI's, as sent to the HTTP server, should contain only US-ASCII
characters (and no spaces). If there are other characters, they should
be encoded using the appropriate RFC-dictated URI-encoding scheme.
2) Whether Firefox is smart enough to automatically encode a URI
properly, when it notices that it contains non-US-ASCII characters, is a
nice aspect of Firefox if it does, but should not confuse the main issue.
In other words, if you send a non-ASCII URI to a server (via curl or
lwp-request e.g.), then you should arrange yourself to URI-encode the
request.
3) According to a previous response, at the receiving side, when Apache
gets a properly-encoded request URI containing non-ASCII characters, it
leaves it encoded and passes it "as is" (or "as bytes") to the
processing layer, which in this case is mod_perl.
4) mod_perl parses the URI and makes it accessible in several ways to
the modules running under it (in this case a request handler or a script).
Question : does mod_perl decode the URI string prior to passing it in
bits and pieces to the handler/script, or not ?
(From another response, it would seem that it doesn't)
5) the handler/script obtains the URI parts from mod_perl, possibly
through the RequestRec or Request object.
If such URI parts contained non-ASCII characters, do these modules
perform any translation, or does the handler/script still receive them
as URI-encoded ?
(From another response, it would seem that they don't, and it does)
6) Now the handler/script has the value of the (for instance) query
parameter "id" (and assume it contains non-ASCII characters), and it
wants to output it back to the browser.
To do that, it must arrange to send to the browser a HTTP header that
will tell the browser in which character set this response is encoded,
since by default the HTTP protocol says it is iso-8859-1.
And it seems that in order to do that, it should use, as minimum
$param = $apr->param('id');
$r->content_type('text/plain; charset="UTF-8"');
$r->print $param;
There are a couple of aspects not mentioned above, such as
- how does the handler/script "know" which decoding it should apply to
the URI elements ? Is it certain that it is UTF-8 ?
Another go, anyone ?
André
Torsten Foertsch wrote:
> On Wed 19 Mar 2008, Eli Shemer wrote:
>> For some reason the following test doesn’t print anything out to the screen
>>
>> Do I need to change something in the apache configuration, or mod_perl’s ?
>>
>>
>>
>> /articles_read.pl?id=חוזרת
>
> This is probably a bug in libapreq2. I have tried this handler:
>
> sub {
> my $r=$_[0];
> $r->content_type('text/html; charset=UTF-8');
> my $x=Apache2::Request->new($r);
> $r->print("<html><body>\nargs=".$r->args."\nparam(x)=".
> $x->param('x')."\n</body></html>\n");
> return Apache2::Const::OK;
> }
>
> http://localhost/test?x=חוזרת entered in FF changes on the fly into
> http://localhost/test?x=%D7%97%D7%95%D7%96%D7%A8%D7%AA and it works.
>
> But on the command line with curl it doesn't:
>
> $ curl 'http://localhost/test?x=חוזרת' -v
> * About to connect() to localhost port 80 (#0)
> * Trying 127.0.0.1... connected
> * Connected to localhost (127.0.0.1) port 80 (#0)
>> GET /test?x=חוזרת HTTP/1.1
>> User-Agent: curl/7.16.4 (i686-suse-linux-gnu) libcurl/7.16.4 OpenSSL/0.9.8e
> zlib/1.2.3 libidn/1.0
>> Host: localhost
>> Accept: */*
>>
> < HTTP/1.1 200 OK
> < Date: Wed, 19 Mar 2008 12:45:29 GMT
> < Server: Apache/2.2.6 (Unix) mod_ssl/2.2.6 OpenSSL/0.9.8e DAV/2 SVN/1.4.5
> mod_apreq2-20051231/2.6.0 mod_perl/2.0.4-dev Perl/v5.8.8
> < Transfer-Encoding: chunked
> < Content-Type: text/html; charset=UTF-8
> <
> <html><body>
> args=x=חוזרת
> param(x)=
> </body></html>
> * Connection #0 to host localhost left intact
> * Closing connection #0
>
> Torsten
>
Re: utf8 urls
Posted by Torsten Foertsch <to...@gmx.net>.
On Wed 19 Mar 2008, Eli Shemer wrote:
> For some reason the following test doesn’t print anything out to the screen
>
> Do I need to change something in the apache configuration, or mod_perl’s ?
>
>
>
> /articles_read.pl?id=חוזרת
This is probably a bug in libapreq2. I have tried this handler:
sub {
my $r=$_[0];
$r->content_type('text/html; charset=UTF-8');
my $x=Apache2::Request->new($r);
$r->print("<html><body>\nargs=".$r->args."\nparam(x)=".
$x->param('x')."\n</body></html>\n");
return Apache2::Const::OK;
}
http://localhost/test?x=חוזרת entered in FF changes on the fly into
http://localhost/test?x=%D7%97%D7%95%D7%96%D7%A8%D7%AA and it works.
But on the command line with curl it doesn't:
$ curl 'http://localhost/test?x=חוזרת' -v
* About to connect() to localhost port 80 (#0)
* Trying 127.0.0.1... connected
* Connected to localhost (127.0.0.1) port 80 (#0)
> GET /test?x=חוזרת HTTP/1.1
> User-Agent: curl/7.16.4 (i686-suse-linux-gnu) libcurl/7.16.4 OpenSSL/0.9.8e
zlib/1.2.3 libidn/1.0
> Host: localhost
> Accept: */*
>
< HTTP/1.1 200 OK
< Date: Wed, 19 Mar 2008 12:45:29 GMT
< Server: Apache/2.2.6 (Unix) mod_ssl/2.2.6 OpenSSL/0.9.8e DAV/2 SVN/1.4.5
mod_apreq2-20051231/2.6.0 mod_perl/2.0.4-dev Perl/v5.8.8
< Transfer-Encoding: chunked
< Content-Type: text/html; charset=UTF-8
<
<html><body>
args=x=חוזרת
param(x)=
</body></html>
* Connection #0 to host localhost left intact
* Closing connection #0
Torsten