You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tomcat.apache.org by bu...@apache.org on 2008/07/16 12:22:52 UTC

DO NOT REPLY [Bug 45406] New: Decoding URI encoded in UTF-16 does not work correctly.

https://issues.apache.org/bugzilla/show_bug.cgi?id=45406

           Summary: Decoding URI encoded in UTF-16 does not work correctly.
           Product: Tomcat 6
           Version: unspecified
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Catalina
        AssignedTo: tomcat-dev@jakarta.apache.org
        ReportedBy: ran.rubinstein@gmail.com
                CC: ran.rubinstein@gmail.com


The request URL decoder component is not doing it's job correctly for UTF-16.

URL encoding rules state that URLs are encoded in ASCII, while non-ascii
characters and special characters are encoded using the % encoding, using a
specified charset. 

The problem is that the UDecode and ByteChunk classes used to decode request
URLs in catalina do not work according to these rules:

UDecode converts all the % encodings to bytes, and then ByteChunk.toString
converts the whole buffer to a string according to the specified encoding.

So, a UTF-16-encoded URL that looks like this:

http://a.com?utf16parameter=%D%7

which is decoded properly with java's URLDecoder.decode() method, is not
decoded properly by catalina, since the characters
'http://a.com?utf16parameter=' are not UTF-16 characters. UDecode convert %D%7
to bytes, but then ByteChunk chokes when trying to decode the whole string as
UTF-16, while %D%7 are the only bytes it should have attempted to decode.

Don't tell me to use UTF-8 in my pages so that the browsers will use UTF-8 to
encode, I'm in a situation where this is impossible.


-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org


DO NOT REPLY [Bug 45406] Decoding URI encoded in UTF-16 does not work correctly.

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=45406





--- Comment #4 from Ran Rubinstein <ra...@gmail.com>  2008-07-16 06:57:15 PST ---
(In reply to comment #3)
> Marking Invalid.
> 
> You aren't able to use utf-16 or ucs-2.  Period.
> 
> The RFC2616 protocol clearly declares the input stream to be an ASCII superset
> stream of otherwise opaque octets.  You can use any representation which is a 
> superset of ASCII and work out what character set you expect, such as UTF-8 or 
> ISO-8859-{any}.
> 
> The %xx syntax clearly defines one byte, and cannot express half a wchar.  If
> you wish to interpret the bytestream in this way, you will have to recombine
> them, but this would be ill advised, as "your protocol" can't necessarily be
> proxied at all.
> 

I accept the statement about RFC2616.
The sad fact is that I have 20,000+ Nokia phones deployed with a buggy browser
that encodes the request parameters in UTF-16LE whatever the page encoding.
Even if Nokia releases a fix, there's no way all of them will update their
firmwares.

My idea was to work around this by using:

request.setCharacterEncoding("UTF-16LE");

in a filter, after detecting the user-agent, and set
useBodyEncodingForURI="true". This doesn't work obviously because of the
situation described above with UDecoder.

My only solution now is to parse the parameters myself and use
URLDecoder.decode(param,"UTF-16LE"), which works correctly. This requires me to
use a HttpServletRequest subclass and override its getParameter* methods.

Thanks for the quick reply, If this is a performance issue, then there's really
no reason to degrade tomcat's general performance for a rare bug in a phone
browser, still, I've seen URLs encoded with non-ascii encodings before.

Ran.


-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org


DO NOT REPLY [Bug 45406] Decoding URI encoded in UTF-16 does not work correctly.

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=45406





--- Comment #8 from Julian Reschke <ju...@gmx.de>  2008-07-16 08:12:02 PST ---
I do(In reply to comment #6)
> If we are to accept UTC-16, let's examine the bytestream of a GET request;
> 
> GET
> \0h\0t\0t\0p\0:\0/\0/\0a\0.\0c\0o\0m\0?\0u\0t\0f\0001\0006\0p\0a\0r\0a\0m\0e\0t\0e\0r\0=\0\0%\0D\00007\0%\0000\0005
> HTTP/1.1
> 
> *That* is utc-16 encoding.
> ...

Yes. That's not allowed. I didn't say that.

However, from RFC2616's point of view it's totally legal to encode non-ASCII
characters in the URL any way you want. There simply is no requirement that it
needs to be a superset of ASCII.

Of course, whether or not that is a good idea is another question.

So yes, if a server needs to support these kinds of URIs, it needs to
workaround the limitations of the servlet engine (another way would be to use
getRequestURI(), and parse that directly).


-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org


DO NOT REPLY [Bug 45406] Decoding URI encoded in UTF-16 does not work correctly.

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=45406





--- Comment #7 from Ran Rubinstein <ra...@gmail.com>  2008-07-16 08:09:21 PST ---
(In reply to comment #6)

I was under the impression that only URL parameters that are part of the query
string have to be encoded, and that when decoding, only %-encoded parts of the
url are affected by the charset.
This is how the Java URLDecoder works:

String s = "http://a.com?utf16param=%D7%05";
s = URLDecoder.decode(s,"UTF-16LE");
System.out.println(s);

out:
http://a.com?utf16param=ח

(


-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org


DO NOT REPLY [Bug 45406] Decoding URI encoded in UTF-16 does not work correctly.

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=45406


Julian Reschke <ju...@gmx.de> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |julian.reschke@gmx.de




-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org


DO NOT REPLY [Bug 45406] Decoding URI encoded in UTF-16 does not work correctly.

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=45406





--- Comment #2 from Ran Rubinstein <ra...@gmail.com>  2008-07-16 06:20:16 PST ---
(In reply to comment #1)
> If you are literally correct below in your observation, the request should
> return 400 invalid, because %D%7 is not valid.  2 hex digits are mandatory.
> 

You are correct, in the text, replace %D%7 with %D7%05. The correct URL is
http://a.com?utf16parameter=%D7%05

%D7%05 is the hebrew letter 'het' encoded in UTF-16LE.


-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org


DO NOT REPLY [Bug 45406] Decoding URI encoded in UTF-16 does not work correctly.

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=45406





--- Comment #1 from Will Rowe <wr...@apache.org>  2008-07-16 06:14:40 PST ---
If you are literally correct below in your observation, the request should
return 400 invalid, because %D%7 is not valid.  2 hex digits are mandatory.


-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org


DO NOT REPLY [Bug 45406] Decoding URI encoded in UTF-16 does not work correctly.

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=45406





--- Comment #9 from Ran Rubinstein <ra...@gmail.com>  2008-07-17 01:17:48 PST ---
(In reply to comment #8)

> Of course, whether or not that is a good idea is another question.
> 
> So yes, if a server needs to support these kinds of URIs, it needs to
> workaround the limitations of the servlet engine (another way would be to use
> getRequestURI(), and parse that directly).
> 

Using getRequestURI() and parsing myself is the simple part. Since the
parameter map in the HttpServletRequest is immutable, I need to use my own
subclass for the request and use it throughout my application (and it can't be
forwarded to external web applications). That solution is not very elegant.

I'm not in a place to judge whether encoding URL's with non-ascii-superset
encodings is legal or not, but:

If non-ascii-superset-encoded URIs are legal, I think tomcat should support
them. Tomcat has the APIs to setCharacterEncoding() and
useBodyEncodingForURI's, which don't limit the encoding to an ascii superset.
It's looks like a simple matter of changing how URI's are decoded, to be
compliant with the %-encoding rules.

If these URI's are illegal, perhaps tomcat should throw some kind of exception
when when setCharacterEncoding is called with an 'illegal' encoding, and
useBodyEncodingForURI is true.


-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org


DO NOT REPLY [Bug 45406] Decoding URI encoded in UTF-16 does not work correctly.

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=45406





--- Comment #10 from Will Rowe <wr...@apache.org>  2008-07-17 08:11:57 PST ---
We might have lost sight of the issue here; this issue is not whether or not
it's possible to encode utf-16-le, it's that the URI was not encoded in that
character set.

As I pointed out above, in the similar -be encoding, we need 16 bits to
transmit 
each character,  This particular browser sent 8 bit octets.  That is not
utf-16.

Two escaped characters does not constitute a utf-16 request, it's a utf-16
fragment within an ASCII/ISO-8859/UTF-8/whatever bytestream.  I would recommend
no change whatsoever in Tomcat's URI parsing code on this issue, although you
do raise an interesting observation w.r.t. useBodyEncodingForURI.

As far as working around it, it might be nice if one could deploy a Valve that
was triggered based on User-Agent, that would probably be the most elegant hack
for you to work around this browser error.  Certainly not for core tomcat.


-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org


DO NOT REPLY [Bug 45406] Decoding URI encoded in UTF-16 does not work correctly.

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=45406





--- Comment #5 from Julian Reschke <ju...@gmx.de>  2008-07-16 07:32:26 PST ---
(In reply to comment #3)
> Marking Invalid.
> 
> You aren't able to use utf-16 or ucs-2.  Period.
> 
> The RFC2616 protocol clearly declares the input stream to be an ASCII superset
> stream of otherwise opaque octets.  You can use any representation which is a 
> superset of ASCII and work out what character set you expect, such as UTF-8 or 
> ISO-8859-{any}.
> ...

Does it? I don't think so. Details, please.

Julian (wearing my hat as editor of RFC2616bis)


-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org


DO NOT REPLY [Bug 45406] Decoding URI encoded in UTF-16 does not work correctly.

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=45406





--- Comment #6 from Will Rowe <wr...@apache.org>  2008-07-16 07:58:51 PST ---
If we are to accept UTC-16, let's examine the bytestream of a GET request;

GET
\0h\0t\0t\0p\0:\0/\0/\0a\0.\0c\0o\0m\0?\0u\0t\0f\0001\0006\0p\0a\0r\0a\0m\0e\0t\0e\0r\0=\0\0%\0D\00007\0%\0000\0005
HTTP/1.1

*That* is utc-16 encoding.

It's clearly defined as OCTETs, and the 'GET' sp and sp "HTTP/n.n" are defined
in 
ASCII.

I'm having a hard time coming to another conclusion, perhaps you can offer one?

Especially relevant are the definitions;

       OCTET          = <any 8-bit sequence of data>
       CHAR           = <any US-ASCII character (octets 0 - 127)>
       CR             = <US-ASCII CR, carriage return (13)>
       LF             = <US-ASCII LF, linefeed (10)>
       SP             = <US-ASCII SP, space (32)>

Further, consider the statement;

   Characters other than those in the "reserved" and "unsafe" sets (see
   RFC 2396 [42]) are equivalent to their ""%" HEX HEX" encoding.

which is nonsensical unless the ASCII definitions of reserved and unsafe
are taken literally.


-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org


DO NOT REPLY [Bug 45406] Decoding URI encoded in UTF-16 does not work correctly.

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=45406





--- Comment #11 from Ran Rubinstein <ra...@gmail.com>  2008-07-21 03:41:16 PST ---
(In reply to comment #10)

Will, I'm sorry to drag this on, but I want to understand fully where I'm wrong
in this.

AFAIK, an ascii URL with one character represented in %-encoding such as
http://www.google.com/q=%D7%05 does represent a legal UTF-16 encoded URL.

UTF-16 %-Encoding does not mean the client sends two bytes or a wchar for each
letter in the URL, but rather that it sends the URL in ASCII, except for the
parts of the query string are not ASCII and they are encoded using %-Encoding,
with the bytes there determined by the selected encoding (usually UTF-8).
This is also the behavior of java's built-in URLEncoder.encode()/decode()
functions.

So a UTF-16 encoded URL, can look like this:

http://www.google.com/q=%D7%05

and be legal.

Is my concept completely off-base?

If this is true, I see no reason for tomcat not to support this (except of
course that the architecture right now does not support it, since the
%-decoding and string building classes are separate - byteChunk expects, well,
a chunk of bytes, which it translates to a string according to the given
encoding. UDecoder translates the URL to this chunk of bytes.
I suggest that instead of this, when processing URLs/URI's tomcat will use a
combined approach that is compatible with the %-encoding rule that only
non-ascii characters are %-encoded.


-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org


DO NOT REPLY [Bug 45406] Decoding URI encoded in UTF-16 does not work correctly.

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=45406


Will Rowe <wr...@apache.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |INVALID




--- Comment #3 from Will Rowe <wr...@apache.org>  2008-07-16 06:30:55 PST ---
Marking Invalid.

You aren't able to use utf-16 or ucs-2.  Period.

The RFC2616 protocol clearly declares the input stream to be an ASCII superset
stream of otherwise opaque octets.  You can use any representation which is a 
superset of ASCII and work out what character set you expect, such as UTF-8 or 
ISO-8859-{any}.

The %xx syntax clearly defines one byte, and cannot express half a wchar.  If
you wish to interpret the bytestream in this way, you will have to recombine
them, but this would be ill advised, as "your protocol" can't necessarily be
proxied at all.


-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org