You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@tomcat.apache.org by Shankar Unni <sh...@netscape.net> on 2007/06/06 00:27:58 UTC

Tomcat URL decoding different from Java's URLDecoder?

We were playing around with a little JSP application, and trying to 
submit (and handle) Big5 characters.

(The real purpose was to exercise our primary app which sniffs HTTP 
traffic and "does stuff" with the raw data it captures - it sees the 
headers and body as sent over the wire.)

One odd thing we noticed was that when we sent in a single (two-byte) 
Big5 character in a form field (the page was already set to character 
encoding Big5), the encoded value sent in the URL was rather screwy:

The original character is (Big5) 0xAE 0x78.

The URL sent by IE said "%AEx". (!!)

Now Java's URLDecoder.decode(s, "Big5") doesn't like the above encoding: 
it insists that both characters be encoded, or neither - it ends up 
decoding this into "?x".

However, Tomcat actually returns the correct single chinese character 
from "request.getParameter("userid")".

Is it doing the URL decoding in a different way than Java's built-in URL 
decoder?


---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: Tomcat URL decoding different from Java's URLDecoder?

Posted by Rashmi Rubdi <ra...@gmail.com>.
Shankar, Chris,

With reference to your previous posts....

> (From section 3.7.1 of the HTTP/1.1 spec).

I tried a small example by not forcing the request's content type,
and was able to see the Big5 characters without any problems even
when the Request's characterEncoding was null. The code below demonstrates this.


>So Tomcat is running with -Dfile.charset=Big5 (so that's the java
>default charset), and just does a 'request.getParameter("foo")'. And the
>value for foo in the body of the POST is "%AEx", which (again) we didn't
>encode - it's IE that's doing this when you type in a Big5 char into a
>text field in a form (I guess it's internally treating the raw bytes as
>ISO8859-1).

"%AEx" , could be a URL Encoded Hex representation of the specific
Big5 character (not sure about it though, because decoding it doesn't
give a Big5 character but instead gives garbage characters)

Here's the example I tried, that shows the Big5 character correctly
when transmitted over HTTP GET , notice the characters in the URL
encoded parameter.

-----------------------------------------
index.jsp
-----------------------------------------

<%@ page pageEncoding="Big5" contentType="text/html; charset=Big5" %>

<html>

<head>

    <meta http-equiv="Content-Type" content="text/html; charset=Big5" />

</head>

<body>

<%
//String paramEncoding = application.getInitParameter("PARAMETER_ENCODING");
//request.setCharacterEncoding(paramEncoding);
%>

Request Character Encoding =
<%=request.getCharacterEncoding()%>



<br/><br/>

Response Character Encoding =
<%=response.getCharacterEncoding()%>


<br/><br/>

<form action="test.jsp" method="GET">
    <input type="text" name="textField" value="造字"/>
    <input type="submit" name="submit" value="submit"/>
</form>

</body>

</html>


------------------------------------------
test.jsp
------------------------------------------

<%@ page import="java.net.URLDecoder" %>
<%@ page import="java.net.URLEncoder" %>
<%@ page pageEncoding="Big5" contentType="text/html; charset=Big5" %>

<html>

<head>

    <meta http-equiv="Content-Type" content="text/html; charset=Big5" />

</head>

<body>

<%
//String paramEncoding = application.getInitParameter("PARAMETER_ENCODING");
//request.setCharacterEncoding(paramEncoding);
%>

Request Character Encoding =
<%=request.getCharacterEncoding()%>

<br/><br/>

Response Character Encoding =
<%=response.getCharacterEncoding()%>


<br/><br/>
Text Filed Contents =
<%=request.getParameter("textField")%>

<br/><br/>

URL Encode:

<%=URLEncoder.encode("造字", "Big5")%>

<br/><br/>

URL Decode:
<%=URLDecoder.decode("%B3y%A6r", "Big5")%>

The above shows garbage characters.

</body>

</html>

---------------------------
web.xml
---------------------------

    <context-param>
        <param-name>PARAMETER_ENCODING</param-name>
        <param-value>Big5</param-value>
    </context-param>

---------------------------
Tomcat 6.x server.xml
----------------------------

    <Connector port="9090" protocol="HTTP/1.1"
               maxThreads="150" connectionTimeout="20000"
               redirectPort="8443"
	       URIEncoding="Big5"/>

-Rashmi

---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: Tomcat URL decoding different from Java's URLDecoder?

Posted by Shankar Unni <sh...@netscape.net>.
Christopher Schultz wrote:

> Shouldn't you use the content-type of the request instead of just
> forcing your own content-type? If the browser does not send a MIME type
> with the request, then the default is defined to be ISO-8859-1:

Well, for a form post, that just means that the text of the body is 
ISO8859-1 (which it is - it's all %blah-encoded strings). It's how to 
actually decode those strings (using a URL decoder) that matter.

In our case, do notice that I'm *not* calling URLDecode in Tomcat 
itself. Tomcat is just running a (poorly-written, deliberately) webapp, 
and we're sniffing that traffic and doing all the decoding in an 
external system.

So Tomcat is running with -Dfile.charset=Big5 (so that's the java 
default charset), and just does a 'request.getParameter("foo")'. And the 
value for foo in the body of the POST is "%AEx", which (again) we didn't 
encode - it's IE that's doing this when you type in a Big5 char into a 
text field in a form (I guess it's internally treating the raw bytes as 
ISO8859-1).

Yet, in the JSP, when I do a request.getParameter("foo"), I magically 
end up with the correct Big5 character in a String. That seemed like magic.

Thanks for the hits about PARAMETER_ENCODING, etc. in web.xml (though I 
don't see how that can influence IE to encode things correctly in the 
simple form POST).


---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: Tomcat URL decoding different from Java's URLDecoder?

Posted by Christopher Schultz <ch...@christopherschultz.net>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Rashmi,

Rashmi Rubdi wrote:
> On 6/5/07, Shankar Unni <sh...@netscape.net> wrote:
>> (the page was already set to character
>> encoding Big5), the encoded value sent in the URL was rather screwy:
>>
>> The original character is (Big5) 0xAE 0x78.
>>
>> The URL sent by IE said "%AEx". (!!)
> 
> Did you also configure the web.xml properly ?
> 
> It should have
> 
> <context-param>
> <param-name>PARAMETER_ENCODING</param-name>
> <param-value>Big5</param-value>
> </context-param>
> 
> Are you setting the request parameter correctly before reading the
> parameter?
> 
> String paramEncoding = application.getInitParameter("PARAMETER_ENCODING");
> request.setCharacterEncoding(paramEncoding);

Shouldn't you use the content-type of the request instead of just
forcing your own content-type? If the browser does not send a MIME type
with the request, then the default is defined to be ISO-8859-1:

"The "charset" parameter is used with some media types to define the
   character set (section 3.4) of the data. When no explicit charset
   parameter is provided by the sender, media subtypes of the "text"
   type are defined to have a default charset value of "ISO-8859-1" when
   received via HTTP. Data in character sets other than "ISO-8859-1" or
   its subsets MUST be labeled with an appropriate charset value. See
   section 3.4.1 for compatibility problems."

(From section 3.7.1 of the HTTP/1.1 spec).

- -chris

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGZqeX9CaO5/Lv0PARAhw2AJ4vRSJG0640jpwwVIrJBlKPx+lSogCgnIj9
GwvHipusYE3VorH1vRFpA18=
=LtK5
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: Tomcat URL decoding different from Java's URLDecoder?

Posted by Rashmi Rubdi <ra...@gmail.com>.
On 6/5/07, Shankar Unni <sh...@netscape.net> wrote:
> (the page was already set to character
> encoding Big5), the encoded value sent in the URL was rather screwy:
>
> The original character is (Big5) 0xAE 0x78.
>
> The URL sent by IE said "%AEx". (!!)

Did you also configure the web.xml properly ?

It should have

<context-param>
<param-name>PARAMETER_ENCODING</param-name>
<param-value>Big5</param-value>
</context-param>

Are you setting the request parameter correctly before reading the parameter?

String paramEncoding = application.getInitParameter("PARAMETER_ENCODING");
request.setCharacterEncoding(paramEncoding);



> Now Java's URLDecoder.decode(s, "Big5") doesn't like the above encoding:
> it insists that both characters be encoded, or neither - it ends up
> decoding this into "?x".
>
> However, Tomcat actually returns the correct single chinese character
> from "request.getParameter("userid")".
>
> Is it doing the URL decoding in a different way than Java's built-in URL
> decoder?
>

-Rashmi

---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: Tomcat URL decoding different from Java's URLDecoder?

Posted by Christopher Schultz <ch...@christopherschultz.net>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Shankar,

Shankar Unni wrote:
> One odd thing we noticed was that when we sent in a single (two-byte)
> Big5 character in a form field (the page was already set to character
> encoding Big5), the encoded value sent in the URL was rather screwy:

This is probably because URLs should be decoded using UTF-8 or
ISO-8859-1 instead of "Big5". Even if the content-type of the request or
response body is Big5, then URL ought to be UTF-8 or ISO-8859-1.

> However, Tomcat actually returns the correct single chinese character
> from "request.getParameter("userid")".
> 
> Is it doing the URL decoding in a different way than Java's built-in URL
> decoder?

Yes, it's using either UTF-8 or ISO-8859-1 to decode the URL parameters
(you can set this in your Connector's configuration) instead of "Big5",
which you were manually choosing when you used Java's URLDecoder.

Hope that helps.

- -chris

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGZe6L9CaO5/Lv0PARAv/7AJwNeZyp39F5N/zYpIUyRK/tyL8HggCgrzU8
hPaSFVlAop0gngQLs2dDsJA=
=3oPF
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org