You are viewing a plain text version of this content. The canonical link for it is here.
Posted to modperl@perl.apache.org by Eli Shemer <ap...@netvision.net.il> on 2008/03/19 13:06:59 UTC

utf8 urls

Hey there

 

For some reason the following test doesn’t print anything out to the screen

Do I need to change something in the apache configuration, or mod_perl’s ?

 

/articles_read.pl?id=חוזרת

 

## get http parameters

$r = shift;

$apr = Apache2::Request->new($r);

print  $apr->param('id');

 

 

thanks in advance.

 


Internal Virus Database is out-of-date.
Checked by AVG Free Edition. 
Version: 7.5.503 / Virus Database: 269.16.4/1146 - Release Date: 22/11/2007
18:55
 

Re: utf8 urls

Posted by André Warnier <aw...@ice-sa.com>.
 From a previous message by Adam Prime in this same list :
[...]
SetHandler modperl doesn't bind 'print' to '$r->print'.  Try SetHandler 
perl-script, or change your code to pass in the request object and use 
$r->print instead of print.
[...]

or, more verbously and explicitly :
if in your Apache configuration for this "location", you used

SetHandler modperl

then, you should not assume that print() sends its output to the 
browser.  But if you did (like you did)

$r = shift; # get the Apache::RequestRec object

then  $r->print() does go back as a response to the browser.
You should probably at least set a content-type header though,
like

$r->content_type('text/plain');
$r->print $apr->param('id');

and, in your case, it might also be a good idea to send back a header 
indicating which is the character set used (presumably UTF-8), since the 
default HTTP character set is iso-8859-1, and the string you send back 
doesn't look as being printable in that charset.

But I don't know exactly how to do that best in mod_perl.
Would the following work ?
$r->content_type('text/plain; charset="UTF-8"');

Also, the previous message talking about how to handle your (apparently) 
UTF-8 request should be taken into account.


André


Eli Shemer wrote:
> Hey there
> 
>  
> 
> For some reason the following test doesn’t print anything out to the screen
> 
> Do I need to change something in the apache configuration, or mod_perl’s ?
> 
>  
> 
> /articles_read.pl?id=חוזרת
> 
>  
> 
> ## get http parameters
> 
> $r = shift;
> 
> $apr = Apache2::Request->new($r);
> 
> print  $apr->param('id');
> 
>  
> 
>  
> 
> thanks in advance.
> 
>  
> 
> 
> Internal Virus Database is out-of-date.
> Checked by AVG Free Edition. 
> Version: 7.5.503 / Virus Database: 269.16.4/1146 - Release Date: 22/11/2007
> 18:55
>  
> 

Re: utf8 urls

Posted by John ORourke <jo...@o-rourke.org>.
Geoffrey Young wrote:
> John ORourke wrote:
>> Eli Shemer wrote:
>>>
>>> For some reason the following test doesn’t print anything out to the 
>>> screen
>>>
>> I'm not sure why you get nothing, but I can tell you strings read 
>> from Apache objects come through as octets and need to be decoded 
>> before use. We're using UTF-8 chars in URLs but I've never used one 
>> in a GET request parameter.
>
> I can't say why it doesn't work, but I'm surprised it would in either 
> case - the only characters explicitly allowed in a uri are us-ascii. 
> from rfc2396:
>

My bad memory there - you are quite correct. The way we do it is the 
accepted way - to URL-encode the UTF-8 encoded text, and that will work 
with URLs and parameters.

eg:

http://www....../categories/name/ty%C3%B6kalut-lamput

is the correct form of:

http://www....../categories/name/työkalut-lamput


encode before printing:

$octets = utf8_encode($my_utf8_string); # make octets
$octets =~ s/([^\041-\177])/sprintf("%%%02X",ord($1))/ge; # URL-encode 
non-ASCII chars
$r->print($octets);
(the above is simplified - you'll also need to encode question marks etc)

decode after reading:

$url = utf8_decode ( $r->uri() );
or
$param = utf8_decode ( $r->param('info') );

cheers
John



Re: utf8 urls

Posted by Geoffrey Young <ge...@modperlcookbook.org>.

John ORourke wrote:
> Eli Shemer wrote:
>>
>> For some reason the following test doesn’t print anything out to the 
>> screen
>>
>> Do I need to change something in the apache configuration, or 
>> mod_perl’s ?
>>
>>  
>>
>> /articles_read.pl?id=חוזרת
>>
>>  
>>
>> ## get http parameters
>>
>> $r = shift;
>>
>> $apr = Apache2::Request->new($r);
>>
>> print  $apr->param('id');
>>
> 
> I'm not sure why you get nothing, but I can tell you strings read from 
> Apache objects come through as octets and need to be decoded before 
> use.  We're using UTF-8 chars in URLs but I've never used one in a GET 
> request parameter.

I can't say why it doesn't work, but I'm surprised it would in either 
case - the only characters explicitly allowed in a uri are us-ascii. 
from rfc2396:

   2.4. Escape Sequences

    Data must be escaped if it does not have a representation using an
    unreserved character; this includes data that does not correspond to
    a printable character of the US-ASCII coded character set, or that
    corresponds to any US-ASCII character that is disallowed, as
    explained below.

I bit of googling turned up this cpan module:

   http://search.cpan.org/dist/URI-Find-UTF8/lib/URI/Find/UTF8.pm

where the docs point to a ja.wikipedia.org page.  for me (firefox 2.0) 
clicking on the "original" uri (the one with the japanese characters) 
opens up a uri with the uri-escaped character sequence.  it's like magic ;)

anyway, my point wasn't to get into some huge debate on whether people 
are (successfully) using utf-8 characters in uris, etc.  rather, it is 
that mod_perl is (mostly) merely a wrapper around apache, and if 
something is improper wrt an official rfc apache generally dismisses it 
rather than bending to a behavior which people may be using anyway.

so, if it works, great.  if not, try making your urls conform to 2396 
and see if you have better results.

--Geoff

Re: utf8 urls

Posted by John ORourke <jo...@o-rourke.org>.
Eli Shemer wrote:
>
> For some reason the following test doesn’t print anything out to the 
> screen
>
> Do I need to change something in the apache configuration, or mod_perl’s ?
>
>  
>
> /articles_read.pl?id=חוזרת
>
>  
>
> ## get http parameters
>
> $r = shift;
>
> $apr = Apache2::Request->new($r);
>
> print  $apr->param('id');
>

I'm not sure why you get nothing, but I can tell you strings read from 
Apache objects come through as octets and need to be decoded before 
use.  We're using UTF-8 chars in URLs but I've never used one in a GET 
request parameter.

hope that helps,
John


>  
>
>  
>
> thanks in advance.
>
>  
>
>
> Internal Virus Database is out-of-date.
> Checked by AVG Free Edition.
> Version: 7.5.503 / Virus Database: 269.16.4/1146 - Release Date: 
> 22/11/2007 18:55
>


Re: utf8 urls

Posted by André Warnier <aw...@ice-sa.com>.
I think that these things can get very confused and confusing very 
quickly, unless one steps through them one step at a time.
Let me try a first iteration :

1) URI's, as sent to the HTTP server, should contain only US-ASCII 
characters (and no spaces).  If there are other characters, they should 
be encoded using the appropriate RFC-dictated URI-encoding scheme.
2) Whether Firefox is smart enough to automatically encode a URI 
properly, when it notices that it contains non-US-ASCII characters, is a 
nice aspect of Firefox if it does, but should not confuse the main issue.
In other words, if you send a non-ASCII URI to a server (via curl or 
lwp-request e.g.), then you should arrange yourself to URI-encode the 
request.
3) According to a previous response, at the receiving side, when Apache 
gets a properly-encoded request URI containing non-ASCII characters, it 
leaves it encoded and passes it "as is" (or "as bytes") to the 
processing layer, which in this case is mod_perl.
4) mod_perl parses the URI and makes it accessible in several ways to 
the modules running under it (in this case a request handler or a script).
Question : does mod_perl decode the URI string prior to  passing it in 
bits and pieces to the handler/script, or not ?
(From another response, it would seem that it doesn't)
5) the handler/script obtains the URI parts from mod_perl, possibly 
through the RequestRec or Request object.
If such URI parts contained non-ASCII characters, do these modules 
perform any translation, or does the handler/script still receive them 
as URI-encoded ?
(From another response, it would seem that they don't, and it does)
6) Now the handler/script has the value of the (for instance) query 
parameter "id" (and assume it contains non-ASCII characters), and it 
wants to output it back to the browser.
To do that, it must arrange to send to the browser a HTTP header that 
will tell the browser in which character set this response is encoded, 
since by default the HTTP protocol says it is iso-8859-1.
And it seems that in order to do that, it should use, as minimum

$param = $apr->param('id');
$r->content_type('text/plain; charset="UTF-8"');
$r->print $param;

There are a couple of aspects not mentioned above, such as
- how does the handler/script "know" which decoding it should apply to 
the URI elements ? Is it certain that it is UTF-8 ?


Another go, anyone ?

André





Torsten Foertsch wrote:
> On Wed 19 Mar 2008, Eli Shemer wrote:
>> For some reason the following test doesn’t print anything out to the screen
>>
>> Do I need to change something in the apache configuration, or mod_perl’s ?
>>
>>  
>>
>> /articles_read.pl?id=חוזרת
> 
> This is probably a bug in libapreq2. I have tried this handler:
> 
> sub {
>   my $r=$_[0];
>   $r->content_type('text/html; charset=UTF-8');
>   my $x=Apache2::Request->new($r);
>   $r->print("<html><body>\nargs=".$r->args."\nparam(x)=".      
>             $x->param('x')."\n</body></html>\n");
>   return Apache2::Const::OK;
> }
> 
> http://localhost/test?x=חוזרת entered in FF changes on the fly into
> http://localhost/test?x=%D7%97%D7%95%D7%96%D7%A8%D7%AA and it works.
> 
> But on the command line with curl it doesn't:
> 
> $ curl 'http://localhost/test?x=חוזרת' -v
> * About to connect() to localhost port 80 (#0)
> *   Trying 127.0.0.1... connected
> * Connected to localhost (127.0.0.1) port 80 (#0)
>> GET /test?x=חוזרת HTTP/1.1
>> User-Agent: curl/7.16.4 (i686-suse-linux-gnu) libcurl/7.16.4 OpenSSL/0.9.8e 
> zlib/1.2.3 libidn/1.0
>> Host: localhost
>> Accept: */*
>>
> < HTTP/1.1 200 OK
> < Date: Wed, 19 Mar 2008 12:45:29 GMT
> < Server: Apache/2.2.6 (Unix) mod_ssl/2.2.6 OpenSSL/0.9.8e DAV/2 SVN/1.4.5 
> mod_apreq2-20051231/2.6.0 mod_perl/2.0.4-dev Perl/v5.8.8
> < Transfer-Encoding: chunked
> < Content-Type: text/html; charset=UTF-8
> <
> <html><body>
> args=x=חוזרת
> param(x)=
> </body></html>
> * Connection #0 to host localhost left intact
> * Closing connection #0
> 
> Torsten
> 

Re: utf8 urls

Posted by Torsten Foertsch <to...@gmx.net>.
On Wed 19 Mar 2008, Eli Shemer wrote:
> For some reason the following test doesn’t print anything out to the screen
>
> Do I need to change something in the apache configuration, or mod_perl’s ?
>
>  
>
> /articles_read.pl?id=חוזרת

This is probably a bug in libapreq2. I have tried this handler:

sub {
  my $r=$_[0];
  $r->content_type('text/html; charset=UTF-8');
  my $x=Apache2::Request->new($r);
  $r->print("<html><body>\nargs=".$r->args."\nparam(x)=".      
            $x->param('x')."\n</body></html>\n");
  return Apache2::Const::OK;
}

http://localhost/test?x=חוזרת entered in FF changes on the fly into
http://localhost/test?x=%D7%97%D7%95%D7%96%D7%A8%D7%AA and it works.

But on the command line with curl it doesn't:

$ curl 'http://localhost/test?x=חוזרת' -v
* About to connect() to localhost port 80 (#0)
*   Trying 127.0.0.1... connected
* Connected to localhost (127.0.0.1) port 80 (#0)
> GET /test?x=חוזרת HTTP/1.1
> User-Agent: curl/7.16.4 (i686-suse-linux-gnu) libcurl/7.16.4 OpenSSL/0.9.8e 
zlib/1.2.3 libidn/1.0
> Host: localhost
> Accept: */*
>
< HTTP/1.1 200 OK
< Date: Wed, 19 Mar 2008 12:45:29 GMT
< Server: Apache/2.2.6 (Unix) mod_ssl/2.2.6 OpenSSL/0.9.8e DAV/2 SVN/1.4.5 
mod_apreq2-20051231/2.6.0 mod_perl/2.0.4-dev Perl/v5.8.8
< Transfer-Encoding: chunked
< Content-Type: text/html; charset=UTF-8
<
<html><body>
args=x=חוזרת
param(x)=
</body></html>
* Connection #0 to host localhost left intact
* Closing connection #0

Torsten