You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@httpd.apache.org by LiuYan 刘研 <lo...@21cn.com> on 2010/12/01 16:31:25 UTC

[users@httpd] mod_cgi: multibyte characters in REQUEST_URI can't converted to correct PATH_INFO

Recently I setup Apache-2.2.17 on Windows Server 2003, and config viewvc in CGI 
mode, viewvc works fine except browsing repository entry which contains Chinese 
characters, it will return HTTP 404 when browsing these entryies, I asked in 
viewvc-users mailing list, they said CGI will interact with system using the 
locale is in use by the environment in which it's running( 
http://viewvc.tigris.org/ds/viewMessage.do?dsForumId=4255&dsMessageId=2686631 ).


I tried a small shell CGI script like the following
################################################################################
#!C:\cygwin\bin\bash.exe
# test.sh 
# Environment Variable 
echo Content-type: text/html 
echo 
echo "<html>" 
echo "<head>" 
echo "<title>" 
echo "CGI Environment Variable" 
echo "</title>" 
echo "</head>" 
echo "<body>" 
echo "SERVER_SOFTWARE=$SERVER_SOFTWARE<br/>" 
echo "SERVER_NAME=$SERVER_NAME<br/>" 
echo "SERVER_PROTOCOL=$SERVER_PROTOCOL<br/>" 
echo "SERVER_PORT=$SERVER_PORT<br/>" 
echo "REQUEST_METHOD=$REQUEST_METHOD<br/>" 
echo "GATEWAY_INTERFACE=$GATEWAY_INTERFACE<br/>" 
echo "PATH_INFO=$PATH_INFO<br/>" 
echo "PATH_TRANSLATED=$PATH_TRANSLATED<br/>" 
echo "REMOTE_HOST=$REMOTE_HOST<br/>" 
echo "REMOTE_ADDR=$REMOTE_ADDR<br/>" 
echo "REMOTE_IDENT=$REMOTE_IDENT<br/>" 
echo "SCRIPT_NAME=$SCRIPT_NAME<br/>" 
echo "QUERY_STRING=$QUERY_STRING<br/>" 
echo "CONTENT_TYPE=$CONTENT_TYPE<br/>" 
echo "CONTENT_LENGTH=$CONTENT_LENGTH<br/>" 

echo "<pre>"
/bin/env
echo "</pre>"

echo "</body>" 
echo "</html>" 

exit 0
################################################################################

and tried 2 URLs in different encoding: UTF-8 and GBK.

"中文" in UTF-8 encoding URL:
http://localhost/cgi-bin/cgi-test.sh/%E4%B8%AD%E6%96%87

"中文" in GBK encoding URL:
http://localhost/cgi-bin/cgi-test.sh/%D6%D0%CE%C4


The binary value of Chinese characters in the result HTML are not correct.

UTF-8:
src :   E4    B8    AD    E6    96    87
dest:C3 A4 C2 B8 C2 AD C3 A6 C2 96 C2 87 

GBK:
src :    D6    D0    CE    C4
dest:C3  96 C3 90 C3 8E C3 84 


I also try add "SetEnv LC_ALL zh_CN.GBK" or "SetEnv LC_ALL zh_CN.UTF-8" in 
httpd.conf which suggested by viewvc-users mailing list, and even tried add 
windows system environment variable LC_ALL, but it doesn't help, am I missed 
something or mod_cgi does not support multibyte characters in REQUEST_URI?

Thanks!


---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
   "   from the digest: users-digest-unsubscribe@httpd.apache.org
For additional commands, e-mail: users-help@httpd.apache.org


[users@httpd] Re: mod_cgi: multibyte characters in REQUEST_URI can't converted to correct PATH_INFO

Posted by LiuYan 刘研 <lo...@21cn.com>.
J. Greenlees <lists <at> jaqui-greenlees.net> writes:

> Just a thought here,
> why not try removing environment complexity and use the windows versions
> of apache's httpd and perl instead of the cygwin based install?
> this makes 1 less set of environment variables that could be causing
> locale issues
> 
> ---------------------------------------------------------------------
> The official User-To-User support forum of the Apache HTTP Server Project.
> See <URL:http://httpd.apache.org/userslist.html> for more info.
> To unsubscribe, e-mail: users-unsubscribe <at> httpd.apache.org
>    "   from the digest: users-digest-unsubscribe <at> httpd.apache.org
> For additional commands, e-mail: users-help <at> httpd.apache.org
> 
> 

Thank you for your response Greenlees!

The installation of httpd is official win32 release (httpd-2.2.17-win32-x86-
no_ssl.msi), and I've also tried ActivePerl (ActivePerl-5.12.2.1202-MSWin32-x86-
293621.msi), and try the official printenv.pl http://localhost/cgi-
bin/printenv.pl/%D6%D0%CE%C4_%E4%B8%AD%E6%96%87, but the result is PATH_INFO="/?
D??_??-???", as you can see, most characters are transformed to question mark, 
it's a complete loss of encode conversion/transforming.

I googled and test lots these days (perl, python, bash script, and even wrote a 
C program), but still got no solution.

Linux users may not encountered this issue, I really don't know why Apache 
mod_cgi on Windows can't handle it well, how is it working? I mean WHEN will 
the charset encode conversion is done? when Apache->mod_cgi, or in mod_cgi 
itself, or when CGI program reading environment?




---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
   "   from the digest: users-digest-unsubscribe@httpd.apache.org
For additional commands, e-mail: users-help@httpd.apache.org


Re: [users@httpd] mod_cgi: multibyte characters in REQUEST_URI can't converted to correct PATH_INFO

Posted by "William A. Rowe Jr." <wr...@rowe-clan.net>.
On 12/1/2010 5:09 PM, J. Greenlees wrote:
> Just a thought here,
> why not try removing environment complexity and use the windows versions
> of apache's httpd and perl instead of the cygwin based install?

You would want to avoid using this because you introduce all of the security
holes present in httpd 1.3 (more, in fact), and secondly because have layered
three abstraction layers on top of each other, when one is required (clib,
cygdll and apr, rather than simply apr which speaks naively to the win32 API,
or even natively to windows sfu).


---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
   "   from the digest: users-digest-unsubscribe@httpd.apache.org
For additional commands, e-mail: users-help@httpd.apache.org


Re: [users@httpd] mod_cgi: multibyte characters in REQUEST_URI can't converted to correct PATH_INFO

Posted by "J. Greenlees" <li...@jaqui-greenlees.net>.
LiuYan 刘研 wrote:
> Recently I setup Apache-2.2.17 on Windows Server 2003, and config viewvc in CGI 
> mode, viewvc works fine except browsing repository entry which contains Chinese 
> characters, it will return HTTP 404 when browsing these entryies, I asked in 
> viewvc-users mailing list, they said CGI will interact with system using the 
> locale is in use by the environment in which it's running( 
> http://viewvc.tigris.org/ds/viewMessage.do?dsForumId=4255&dsMessageId=2686631 ).
> 
> 
> I tried a small shell CGI script like the following
> ################################################################################
> #!C:\cygwin\bin\bash.exe
> # test.sh 
> # Environment Variable 
> echo Content-type: text/html 
> echo 
> echo "<html>" 
> echo "<head>" 
> echo "<title>" 
> echo "CGI Environment Variable" 
> echo "</title>" 
> echo "</head>" 
> echo "<body>" 
> echo "SERVER_SOFTWARE=$SERVER_SOFTWARE<br/>" 
> echo "SERVER_NAME=$SERVER_NAME<br/>" 
> echo "SERVER_PROTOCOL=$SERVER_PROTOCOL<br/>" 
> echo "SERVER_PORT=$SERVER_PORT<br/>" 
> echo "REQUEST_METHOD=$REQUEST_METHOD<br/>" 
> echo "GATEWAY_INTERFACE=$GATEWAY_INTERFACE<br/>" 
> echo "PATH_INFO=$PATH_INFO<br/>" 
> echo "PATH_TRANSLATED=$PATH_TRANSLATED<br/>" 
> echo "REMOTE_HOST=$REMOTE_HOST<br/>" 
> echo "REMOTE_ADDR=$REMOTE_ADDR<br/>" 
> echo "REMOTE_IDENT=$REMOTE_IDENT<br/>" 
> echo "SCRIPT_NAME=$SCRIPT_NAME<br/>" 
> echo "QUERY_STRING=$QUERY_STRING<br/>" 
> echo "CONTENT_TYPE=$CONTENT_TYPE<br/>" 
> echo "CONTENT_LENGTH=$CONTENT_LENGTH<br/>" 
> 
> echo "<pre>"
> /bin/env
> echo "</pre>"
> 
> echo "</body>" 
> echo "</html>" 
> 
> exit 0
> ################################################################################
> 
> and tried 2 URLs in different encoding: UTF-8 and GBK.
> 
> "中文" in UTF-8 encoding URL:
> http://localhost/cgi-bin/cgi-test.sh/%E4%B8%AD%E6%96%87
> 
> "中文" in GBK encoding URL:
> http://localhost/cgi-bin/cgi-test.sh/%D6%D0%CE%C4
> 
> 
> The binary value of Chinese characters in the result HTML are not correct.
> 
> UTF-8:
> src :   E4    B8    AD    E6    96    87
> dest:C3 A4 C2 B8 C2 AD C3 A6 C2 96 C2 87 
> 
> GBK:
> src :    D6    D0    CE    C4
> dest:C3  96 C3 90 C3 8E C3 84 
> 
> 
> I also try add "SetEnv LC_ALL zh_CN.GBK" or "SetEnv LC_ALL zh_CN.UTF-8" in 
> httpd.conf which suggested by viewvc-users mailing list, and even tried add 
> windows system environment variable LC_ALL, but it doesn't help, am I missed 
> something or mod_cgi does not support multibyte characters in REQUEST_URI?
> 
> Thanks!
Just a thought here,
why not try removing environment complexity and use the windows versions
of apache's httpd and perl instead of the cygwin based install?
this makes 1 less set of environment variables that could be causing
locale issues

---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
   "   from the digest: users-digest-unsubscribe@httpd.apache.org
For additional commands, e-mail: users-help@httpd.apache.org


Re: [users@httpd] Re: mod_cgi: multibyte characters in REQUEST_URI can't converted to correct PATH_INFO

Posted by "William A. Rowe Jr." <wr...@rowe-clan.net>.
On 12/16/2010 4:06 AM, LiuYan 刘研 wrote:
> William A. Rowe Jr. <wrowe <at> rowe-clan.net> writes:
> 
>>
>> On 12/1/2010 9:31 AM, LiuYan 刘研 wrote:
>>> Recently I setup Apache-2.2.17 on Windows Server 2003, and config viewvc in 
> CGI 
>>> mode, viewvc works fine except browsing repository entry which contains 
> Chinese 
>>> characters, it will return HTTP 404 when browsing these entryies, I asked 
> in 
>>> viewvc-users mailing list, they said CGI will interact with system using 
> the 
>>> locale is in use by the environment in which it's running( 
>>> http://viewvc.tigris.org/ds/viewMessage.do?
> dsForumId=4255&dsMessageId=2686631 ).
>>
>> If you set up viewvc's CGI host to run under the utf-8 code page, things 
> should
>> work correctly.  On win32, all file names are unicode, and httpd and dav then
>> represent these as utf-8.
>>
> 
> Thank you William!
> 
> I don't how to set default windows code page to UTF-8, there's no UTF-8 in 
> ControlPanel--Locale/Language--Advanced, I try change code page to 65001(UTF-8) 
> in DOS prompt window, and run httpd.exe in DOS prompt window, but I got same 
> result.

Numerically you are right.  Just to understand what httpd does, it has passed all
of the environment table and CGI variables as Unicode.  That will be translated
by windows cmd.exe environment into whatever code page you are running (and you
should choose the code page to include all of your possible responses).  When
you prepare results which offer links, you might explicitly need to translate
them to utf-8.

If you run a unicode-aware language, there is no translation at all, or if there
is translation, it occurs based on the unicode program input from the environment.

> part of that answer:
> ---------
> ...
> However most byte-based tools using the C stdio (and I'm assuming this applies 
> to ColdFusion, as it does under Perl, Python 2, PHP etc.) then try to read the 
> environment variables as bytes, and the MS C runtime encodes the Unicode 
> contents again using the Windows default code page. So any characters that 
> don't fit in the default code page are lost for good. This would include your 
> Arabic characters when running on a Western Windows install.


exactly, any time you pass through the command environment this happens, unless
the program entry points are the unicode-aware flavors.


---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
   "   from the digest: users-digest-unsubscribe@httpd.apache.org
For additional commands, e-mail: users-help@httpd.apache.org


[users@httpd] Re: mod_cgi: multibyte characters in REQUEST_URI can't converted to correct PATH_INFO

Posted by LiuYan 刘研 <lo...@21cn.com>.
William A. Rowe Jr. <wrowe <at> rowe-clan.net> writes:

> 
> On 12/1/2010 9:31 AM, LiuYan 刘研 wrote:
> > Recently I setup Apache-2.2.17 on Windows Server 2003, and config viewvc in 
CGI 
> > mode, viewvc works fine except browsing repository entry which contains 
Chinese 
> > characters, it will return HTTP 404 when browsing these entryies, I asked 
in 
> > viewvc-users mailing list, they said CGI will interact with system using 
the 
> > locale is in use by the environment in which it's running( 
> > http://viewvc.tigris.org/ds/viewMessage.do?
dsForumId=4255&dsMessageId=2686631 ).
> 
> If you set up viewvc's CGI host to run under the utf-8 code page, things 
should
> work correctly.  On win32, all file names are unicode, and httpd and dav then
> represent these as utf-8.
> 

Thank you William!

I don't how to set default windows code page to UTF-8, there's no UTF-8 in 
ControlPanel--Locale/Language--Advanced, I try change code page to 65001(UTF-8) 
in DOS prompt window, and run httpd.exe in DOS prompt window, but I got same 
result.
----------
cd Apache\bin
chcp 65001
httpd
----------


There's a similar article/question on stackoverflow.com, 
http://stackoverflow.com/questions/2764446/problem-using-unicode-in-urls-with-
cgi-path-info-in-coldfusion

Although he use ColdFusion on IIS, but he encountered exactly same problem. And 
one answer figured out the cause, but I'm not sure if it's right.

part of that answer:
---------
...
However most byte-based tools using the C stdio (and I'm assuming this applies 
to ColdFusion, as it does under Perl, Python 2, PHP etc.) then try to read the 
environment variables as bytes, and the MS C runtime encodes the Unicode 
contents again using the Windows default code page. So any characters that 
don't fit in the default code page are lost for good. This would include your 
Arabic characters when running on a Western Windows install.
...
---------



---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
   "   from the digest: users-digest-unsubscribe@httpd.apache.org
For additional commands, e-mail: users-help@httpd.apache.org


Re: [users@httpd] mod_cgi: multibyte characters in REQUEST_URI can't converted to correct PATH_INFO

Posted by "William A. Rowe Jr." <wr...@rowe-clan.net>.
On 12/1/2010 9:31 AM, LiuYan 刘研 wrote:
> Recently I setup Apache-2.2.17 on Windows Server 2003, and config viewvc in CGI 
> mode, viewvc works fine except browsing repository entry which contains Chinese 
> characters, it will return HTTP 404 when browsing these entryies, I asked in 
> viewvc-users mailing list, they said CGI will interact with system using the 
> locale is in use by the environment in which it's running( 
> http://viewvc.tigris.org/ds/viewMessage.do?dsForumId=4255&dsMessageId=2686631 ).

If you set up viewvc's CGI host to run under the utf-8 code page, things should
work correctly.  On win32, all file names are unicode, and httpd and dav then
represent these as utf-8.

---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
   "   from the digest: users-digest-unsubscribe@httpd.apache.org
For additional commands, e-mail: users-help@httpd.apache.org