You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@tomcat.apache.org by Mike Wilson <mi...@hotmail.com> on 2013/02/14 15:51:48 UTC

getRequestURI() in relation to Connector.URIEncoding

I can see that even if you specify URIEncoding=UTF-8 in server.xml,
calls to HttpServletRequest.getRequestURI() will still return an
undecoded String. (This is probably because of the "specification
text" in javadoc: "The web container does not decode this String.")

My question is if this behaviour has changed throughout Tomcat 
versions?

We got problems with this when upgrading to Tomcat 7, and it seems
we have been getting decoded strings previously when we were using
Jboss 4 (based an Tomcat 5.5 IIRC).

Thanks
Mike Wilson


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: [OT] getRequestURI() in relation to Connector.URIEncoding

Posted by André Warnier <aw...@ice-sa.com>.
Mike Wilson wrote:
...
> 
> Example 2: path /ä in "binary" Unicode
>   GET /.. [0xC3,0xA4]
> 

To nitpick : this is not "binary Unicode". It is simply non-URL-encoded, raw UTF-8, which 
is itself an encoding of Unicode.

The Unicode "codepoint" of "ä" is 0xE4 (decimal 228), usually represented as U+00E4.
That would be the "binary Unicode" value of this character (although one could argue that 
"11100100" would be more proper for binary).
It represents the position of this character in the overall Unicode characters table.

This is encoded as the 2 bytes [0xC3,0xA4] (decimal [195,164]) in the UTF-8 encoding.

Confusion in terminology leads to "mojibake", which in German can be translated as 
"Buchstabensalat" (see http://en.wikipedia.org/wiki/Mojibake).


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


RE: getRequestURI() in relation to Connector.URIEncoding

Posted by Mike Wilson <mi...@hotmail.com>.
[multiple inline responses]

Rainer Jung wrote:
> I doubt that such URLs are invalid - not based on any code inspection,
> but simply on the fact that mod_jk decoded percent encoding before
> forwarding for a long time (5.5 years, from Oct. 2001 to May 2007,
> version 1.2.0 to 1.2.22). Since version 1.2.24 any bytes in the URI
> expected to be unsafe are percent encoded before forwarding. At least
> that's the default. If you use an non-default ForwardURIxxx option via
> "JkOptions", then that behavior depend on the chosen setting.
> 
> Nevertheless it makes sense to check and clarify.
> 
> Which mod_jk version and JkOptions are you using?

We were indeed running with the "2007" default, ie resulting in
ForwardURICompat which has an appropriate warning in the docs.

But my point is not that a change in Tomcat could hit "us" - we will 
correct our config this week. My point is that invalidating these urls
could break sites for folks that don't follow this mailing list and
just update to the latest Tomcat ;-)


Mark Thomas wrote:
<snip>
> While it is a little surprising that getRequestURI() returns 
> characters
> outside of those defined for uric by RFC2396 given the circumstances I
> think it is reasonable (for AJP) since that is what Tomcat received.
> Arguably a byte that represents a character not in uric should be
> re-encoded using %nn before including it in the return value for
> getRequestURI() but I don't see a need to implement that. If it was
> causing a problem somehow then I could be persuaded otherwise.
> 
> 
> I am more surprised by the HTTP connector. Looking at the code it is
> clear why this happens. The sequence is:
> 
> 1. %nn -> byte
> 2. normalise
> 3. convert to characters
> 
> Bytes that should have been %nn encoded but have not, simply skip the
> first stage and then continue as normal.
> 
> Where this could get messy is when the client converts multibyte
> characters to bytes using one encoding and Tomcat converts those bytes
> to characters using a different encoding. However, while this might
> cause unexpected behaviour from the client's point of view I don't see
> how this could cause a problem for Tomcat. Any sequence of bytes that
> Tomcat ends up processing from stage 2 as a result of byte -> char
> conversion issues onwards could be sent legally using %nn encoding.
> 
> Tomcat could justifiably reject these requests as not 
> conforming to RFC
> 2616. That said, RFC2616 also encourages servers to be tolerant about
> that they receive from clients and I think this falls into that
> category. As long as such behaviour does not cause a problem 
> for Tomcat
> I think it is reasonable to leave the current behaviour as is.
> 
> The leaves the behaviour of getRequestURI(). It is returning what the
> client sent so no issue there. Again given a specific issue I'd be
> prepared to look at %nn encoding for characters not in uric. I agree
> access to the bytes would be ideal but since bytes are only necessary
> when going above and beyond what is required by RFC 2616 it isn't
> surprising that the Servlet EG opted to return a String here.

I think we are talking about four alternatives on how to handle 
this. Here's my 2c about them:

1) Leave as is
I wish getRequestURI() was declared with byte[] return value...
It hurts to see these bytes copied straight into a string. I like
this alternative the least.

2) Invalid, throw an error back at the client
This is strict and clear, might surprise some folks if implemented in
a point release.

3) Decode binary chars in getRequestURI() according to URIEncoding (ie, 
returning a fully decoded value.)
This follows Postel's law. I like Postel's law.

4) Revert binary chars in getRequestURI() back to URL encoding (ie, 
returning a value with % notation.)
This follows Postel's law. I like Postel's law.

Best regards
Mike


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: getRequestURI() in relation to Connector.URIEncoding

Posted by Mark Thomas <ma...@apache.org>.
On 18/02/2013 11:44, André Warnier wrote:
> Mark Thomas wrote:
>> On 18/02/2013 09:54, Rainer Jung wrote:
>>> On 17.02.2013 23:57, André Warnier wrote:
>>
>>>> Otherwise, my feeling is that it will cost you quite a number of beers
>>>> to stop Mark from fixing what could potentially be a security issue,
>>>> now
>>>> that he's sniffed it.
>>> :)
>>>
>>> Not sure whether Mark's sniffing changes based on the fact that we are
>>> now talking about the AJP part of the connectors.
>>
>> It does mean I'm rather less concerned since that explains why the
>> request wasn't rejected with a 400 response.
> 
> Well, the OP did not specifically test with the HTTP Connector, but it
> doesn't mean that the issue is not there too..

Very true. I did a quick test and the request is not rejected which
surprised me rather.

>> I still want to look at this to understand why getRequestURI() is
>> behaving the way it is. There might still be a bug here.
>>
> 
> Looks like getRequestURI() is behaving according to the Javadocs though,
> by providing the original request line undecoded, "as is".  The issue is
> that the request should probably not even get to the point where it can
> be retrieved by getRequestURI(), no ?

A little digging and it is clear that the AJP connector is behaving as
intended. Any %nn values sent by the client are decoded by the reverse
proxy so the AJP message contains (mostly) the raw bytes. There is the
potential for a problem here (CVE-2007-0450, CVE-2007-1860) but as long
as the reverse proxy is correctly configured (which it is by default -
see [1]) all will be fine as the potentially problematic bytes will be
re-encoded.

While it is a little surprising that getRequestURI() returns characters
outside of those defined for uric by RFC2396 given the circumstances I
think it is reasonable (for AJP) since that is what Tomcat received.
Arguably a byte that represents a character not in uric should be
re-encoded using %nn before including it in the return value for
getRequestURI() but I don't see a need to implement that. If it was
causing a problem somehow then I could be persuaded otherwise.


I am more surprised by the HTTP connector. Looking at the code it is
clear why this happens. The sequence is:

1. %nn -> byte
2. normalise
3. convert to characters

Bytes that should have been %nn encoded but have not, simply skip the
first stage and then continue as normal.

Where this could get messy is when the client converts multibyte
characters to bytes using one encoding and Tomcat converts those bytes
to characters using a different encoding. However, while this might
cause unexpected behaviour from the client's point of view I don't see
how this could cause a problem for Tomcat. Any sequence of bytes that
Tomcat ends up processing from stage 2 as a result of byte -> char
conversion issues onwards could be sent legally using %nn encoding.

Tomcat could justifiably reject these requests as not conforming to RFC
2616. That said, RFC2616 also encourages servers to be tolerant about
that they receive from clients and I think this falls into that
category. As long as such behaviour does not cause a problem for Tomcat
I think it is reasonable to leave the current behaviour as is.

The leaves the behaviour of getRequestURI(). It is returning what the
client sent so no issue there. Again given a specific issue I'd be
prepared to look at %nn encoding for characters not in uric. I agree
access to the bytes would be ideal but since bytes are only necessary
when going above and beyond what is required by RFC 2616 it isn't
surprising that the Servlet EG opted to return a String here.


So in summary, not what I was expecting but after some digging I don't
see anything that particularly concerns me at this point.


> The beer question is still open..

Beers are always welcome. I'll be at ApacheConNA in Portland, OR next
week :)

Mark

[1] http://tomcat.apache.org/connectors-doc/reference/apache.html#Forwarding


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: getRequestURI() in relation to Connector.URIEncoding

Posted by André Warnier <aw...@ice-sa.com>.
Mark Thomas wrote:
> On 18/02/2013 09:54, Rainer Jung wrote:
>> On 17.02.2013 23:57, André Warnier wrote:
> 
>>> Otherwise, my feeling is that it will cost you quite a number of beers
>>> to stop Mark from fixing what could potentially be a security issue, now
>>> that he's sniffed it.
>> :)
>>
>> Not sure whether Mark's sniffing changes based on the fact that we are
>> now talking about the AJP part of the connectors.
> 
> It does mean I'm rather less concerned since that explains why the
> request wasn't rejected with a 400 response.

Well, the OP did not specifically test with the HTTP Connector, but it doesn't mean that 
the issue is not there too..

> 
> I still want to look at this to understand why getRequestURI() is
> behaving the way it is. There might still be a bug here.
>

Looks like getRequestURI() is behaving according to the Javadocs though, by providing the 
original request line undecoded, "as is".  The issue is that the request should probably 
not even get to the point where it can be retrieved by getRequestURI(), no ?


The beer question is still open..

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: getRequestURI() in relation to Connector.URIEncoding

Posted by Mark Thomas <ma...@apache.org>.
On 18/02/2013 09:54, Rainer Jung wrote:
> On 17.02.2013 23:57, André Warnier wrote:

>> Otherwise, my feeling is that it will cost you quite a number of beers
>> to stop Mark from fixing what could potentially be a security issue, now
>> that he's sniffed it.
> 
> :)
> 
> Not sure whether Mark's sniffing changes based on the fact that we are
> now talking about the AJP part of the connectors.

It does mean I'm rather less concerned since that explains why the
request wasn't rejected with a 400 response.

I still want to look at this to understand why getRequestURI() is
behaving the way it is. There might still be a bug here.

Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: getRequestURI() in relation to Connector.URIEncoding

Posted by Rainer Jung <ra...@kippdata.de>.
On 17.02.2013 23:57, André Warnier wrote:
> Mike Wilson wrote:
>> Mark Thomas wrote:
>>> On 17/02/2013 16:54, André Warnier wrote:
>>>> Mike Wilson wrote:
>>> <snip/>
>>>
>>>>> Example 2: path /ä in "binary" Unicode
>>>>>   GET /.. [0xC3,0xA4]
>>>>>   request.getRequestURI() -> "/.." [0xC3,0xA4]
>>>>>   request.getPathInfo()   -> "/ä"
>>> <snip/>
>>>
>>>> I believe that your example #2 above is simply illegal.
>>>> One is not supposed to send such bytes in a URL without 
>>> URL-encoding them.
>>>> That's per the HTTP RFC itself :
>>>> RFC 2616 3.2.2 & 3.2.3
>>>> (http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.2.2)
>>>> -> RFC 2396 part 2. URI Characters and Escape Sequences
>>>> (http://www.ietf.org/rfc/rfc2396.txt)
>>>>
>>>> And I believe that the fact that Tomcat is returning the "correct"
>>>> translation in the corresponding request.getPathInfo() is purely
>>>> accidental, and it could be argued that this is a bug in 
>>> Tomcat : the
>>>> request should probably have been rejected, because the 
>>> requested URL
>>>> was invalid.
>>> +1. It is on my list of things to do to check why this wasn't
>>> rejected with a 400 response.
>>>
>>> Mark
>>
>> Explicitly making this invalid is probably fine, although it might
>> be looked upon as "breaking" working systems. Note that we have
>> apparently been running with a setup sending these binary URLs
>> for years, where mod_jk is the source of the invalid URLs.
>> Ie, the browser sends a nice URL-encoded URL which is decoded by
>> mod_jk before sending to Tomcat.
>>
>> So might be appropriate to hold off this change to a release where
>> back compat isn't crucial?
>>
> 
> Mmmm.
> It stretches the imagination a bit to imagine that mod_jk by default
> takes a valid URL and makes it invalid before forwarding it to Tomcat.

The web server will first decode the URL to be able to do whatever it is
configured to do. When mod_jk needs to forward the request, there's a
decision needed:

- using the original undecoded URL: that seems to be safe, but means it
will be incompatible with any URL rewritng configured in Apache, e.g.
using mod_rewrite

- using the final decoded and maybe rewritten URL: this is insecure,
because it can be used for double-encoding attacks.

- using the final decoded and maybe rewritten URL, but re-encoding any
bytes that doe not seem to be safe: that's what mod_jk currently does by
default.

> As far as I recall, there are several options in mod_jk (ForwardURI*
> family) which allow to do things there, some of them unsafe.

Right, see above. The default should be safe.

> So it raises the question : are you doing something until now which is
> considered as unsafe, and therefore are having that problem ?
> (And a linked question is whether by changing this mod_jk option you
> could restore operability with a Tomcat rejecting the invalid URLs).
> 
> Otherwise, my feeling is that it will cost you quite a number of beers
> to stop Mark from fixing what could potentially be a security issue, now
> that he's sniffed it.

:)

Not sure whether Mark's sniffing changes based on the fact that we are
now talking about the AJP part of the connectors.

Regards,

Rainer

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: getRequestURI() in relation to Connector.URIEncoding

Posted by André Warnier <aw...@ice-sa.com>.
Mike Wilson wrote:
> Mark Thomas wrote:
>> On 17/02/2013 16:54, André Warnier wrote:
>>> Mike Wilson wrote:
>> <snip/>
>>
>>>> Example 2: path /ä in "binary" Unicode
>>>>   GET /.. [0xC3,0xA4]
>>>>   request.getRequestURI() -> "/.." [0xC3,0xA4]
>>>>   request.getPathInfo()   -> "/ä"
>> <snip/>
>>
>>> I believe that your example #2 above is simply illegal.
>>> One is not supposed to send such bytes in a URL without 
>> URL-encoding them.
>>> That's per the HTTP RFC itself :
>>> RFC 2616 3.2.2 & 3.2.3
>>> (http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.2.2)
>>> -> RFC 2396 part 2. URI Characters and Escape Sequences
>>> (http://www.ietf.org/rfc/rfc2396.txt)
>>>
>>> And I believe that the fact that Tomcat is returning the "correct"
>>> translation in the corresponding request.getPathInfo() is purely
>>> accidental, and it could be argued that this is a bug in 
>> Tomcat : the
>>> request should probably have been rejected, because the 
>> requested URL
>>> was invalid.
>> +1. It is on my list of things to do to check why this wasn't 
>> rejected 
>> with a 400 response.
>>
>> Mark
> 
> Explicitly making this invalid is probably fine, although it might
> be looked upon as "breaking" working systems. Note that we have
> apparently been running with a setup sending these binary URLs
> for years, where mod_jk is the source of the invalid URLs.
> Ie, the browser sends a nice URL-encoded URL which is decoded by 
> mod_jk before sending to Tomcat.
> 
> So might be appropriate to hold off this change to a release where
> back compat isn't crucial?
> 

Mmmm.
It stretches the imagination a bit to imagine that mod_jk by default takes a valid URL and 
makes it invalid before forwarding it to Tomcat.
As far as I recall, there are several options in mod_jk (ForwardURI* family) which allow 
to do things there, some of them unsafe.
So it raises the question : are you doing something until now which is considered as 
unsafe, and therefore are having that problem ?
(And a linked question is whether by changing this mod_jk option you could restore 
operability with a Tomcat rejecting the invalid URLs).

Otherwise, my feeling is that it will cost you quite a number of beers to stop Mark from 
fixing what could potentially be a security issue, now that he's sniffed it.


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: getRequestURI() in relation to Connector.URIEncoding

Posted by Rainer Jung <ra...@kippdata.de>.
On 17.02.2013 23:00, Mike Wilson wrote:
> Mark Thomas wrote:
>> On 17/02/2013 16:54, André Warnier wrote:
>>> Mike Wilson wrote:
>>
>> <snip/>
>>
>>>> Example 2: path /ä in "binary" Unicode
>>>>   GET /.. [0xC3,0xA4]
>>>>   request.getRequestURI() -> "/.." [0xC3,0xA4]
>>>>   request.getPathInfo()   -> "/ä"
>>
>> <snip/>
>>
>>> I believe that your example #2 above is simply illegal.
>>> One is not supposed to send such bytes in a URL without 
>> URL-encoding them.
>>> That's per the HTTP RFC itself :
>>> RFC 2616 3.2.2 & 3.2.3
>>> (http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.2.2)
>>> -> RFC 2396 part 2. URI Characters and Escape Sequences
>>> (http://www.ietf.org/rfc/rfc2396.txt)
>>>
>>> And I believe that the fact that Tomcat is returning the "correct"
>>> translation in the corresponding request.getPathInfo() is purely
>>> accidental, and it could be argued that this is a bug in 
>> Tomcat : the
>>> request should probably have been rejected, because the 
>> requested URL
>>> was invalid.
>>
>> +1. It is on my list of things to do to check why this wasn't 
>> rejected 
>> with a 400 response.
>>
>> Mark
> 
> Explicitly making this invalid is probably fine, although it might
> be looked upon as "breaking" working systems. Note that we have
> apparently been running with a setup sending these binary URLs
> for years, where mod_jk is the source of the invalid URLs.
> Ie, the browser sends a nice URL-encoded URL which is decoded by 
> mod_jk before sending to Tomcat.
> 
> So might be appropriate to hold off this change to a release where
> back compat isn't crucial?

Now you throw in another component in the mix. mod_jk is not using HTTP
as a protocol to talk to Tomcat and the protocol decoding is not
identical with the HTTP one. Before saying such binary URLs are invalid
someone would need to check the AJP protocol and the protocol parser in
Tomcat about this.

I doubt that such URLs are invalid - not based on any code inspection,
but simply on the fact that mod_jk decoded percent encoding before
forwarding for a long time (5.5 years, from Oct. 2001 to May 2007,
version 1.2.0 to 1.2.22). Since version 1.2.24 any bytes in the URI
expected to be unsafe are percent encoded before forwarding. At least
that's the default. If you use an non-default ForwardURIxxx option via
"JkOptions", then that behavior depend on the chosen setting.

Nevertheless it makes sense to check and clarify.

Which mod_jk version and JkOptions are you using?

Can you give a real example of the original URI, the URI that mod_jk
forwards (JkLogLevel debug will show it, but that's not meant for
production) and how that forwarded URL should look like instead?

Regards,

Rainer



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


RE: getRequestURI() in relation to Connector.URIEncoding

Posted by Mike Wilson <mi...@hotmail.com>.
Mark Thomas wrote:
> On 17/02/2013 16:54, André Warnier wrote:
> > Mike Wilson wrote:
> 
> <snip/>
> 
> >> Example 2: path /ä in "binary" Unicode
> >>   GET /.. [0xC3,0xA4]
> >>   request.getRequestURI() -> "/.." [0xC3,0xA4]
> >>   request.getPathInfo()   -> "/ä"
> 
> <snip/>
> 
> > I believe that your example #2 above is simply illegal.
> > One is not supposed to send such bytes in a URL without 
> URL-encoding them.
> > That's per the HTTP RFC itself :
> > RFC 2616 3.2.2 & 3.2.3
> > (http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.2.2)
> > -> RFC 2396 part 2. URI Characters and Escape Sequences
> > (http://www.ietf.org/rfc/rfc2396.txt)
> >
> > And I believe that the fact that Tomcat is returning the "correct"
> > translation in the corresponding request.getPathInfo() is purely
> > accidental, and it could be argued that this is a bug in 
> Tomcat : the
> > request should probably have been rejected, because the 
> requested URL
> > was invalid.
> 
> +1. It is on my list of things to do to check why this wasn't 
> rejected 
> with a 400 response.
> 
> Mark

Explicitly making this invalid is probably fine, although it might
be looked upon as "breaking" working systems. Note that we have
apparently been running with a setup sending these binary URLs
for years, where mod_jk is the source of the invalid URLs.
Ie, the browser sends a nice URL-encoded URL which is decoded by 
mod_jk before sending to Tomcat.

So might be appropriate to hold off this change to a release where
back compat isn't crucial?

Best regards
Mike


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: getRequestURI() in relation to Connector.URIEncoding

Posted by Mark Thomas <ma...@apache.org>.
On 17/02/2013 16:54, André Warnier wrote:
> Mike Wilson wrote:

<snip/>

>> Example 2: path /ä in "binary" Unicode
>>   GET /.. [0xC3,0xA4]
>>   request.getRequestURI() -> "/.." [0xC3,0xA4]
>>   request.getPathInfo()   -> "/ä"

<snip/>

> I believe that your example #2 above is simply illegal.
> One is not supposed to send such bytes in a URL without URL-encoding them.
> That's per the HTTP RFC itself :
> RFC 2616 3.2.2 & 3.2.3
> (http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.2.2)
> -> RFC 2396 part 2. URI Characters and Escape Sequences
> (http://www.ietf.org/rfc/rfc2396.txt)
>
> And I believe that the fact that Tomcat is returning the "correct"
> translation in the corresponding request.getPathInfo() is purely
> accidental, and it could be argued that this is a bug in Tomcat : the
> request should probably have been rejected, because the requested URL
> was invalid.

+1. It is on my list of things to do to check why this wasn't rejected 
with a 400 response.

Mark


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: getRequestURI() in relation to Connector.URIEncoding

Posted by André Warnier <aw...@ice-sa.com>.
Mike Wilson wrote:
> Hi Chris,
> 
> I'm aware of the two levels of encoding but I'm wondering whether 
> servlet specification writers were :-)
> Here are two examples from Tomcat 7 running with URIEncoding="UTF-8".
> 
> Example 1: path /ä in URL-encoded Unicode as sent from browser
>   GET /%C3%A4
>   request.getRequestURI() -> "/%C3%A4"
>   request.getPathInfo()   -> "/ä"
> 
> Example 2: path /ä in "binary" Unicode
>   GET /.. [0xC3,0xA4]
>   request.getRequestURI() -> "/.." [0xC3,0xA4]
>   request.getPathInfo()   -> "/ä"
> 
> So here we can see that getRequestURI() returns the path completely
> undecoded, ie doesn't apply URL decoding nor character decoding. In
> example 1 this is what I expected, but in example 2 the result is
> that getRequestURI() returns a String containing undecoded binary.
> I would expect a String to have been converted to the appropriate
> character set, otherwise the method should return a byte[].
> 
> Internally Tomcat deals with both these examples as we can see
> getPathInfo() always return the correct decoded path, so I guess 
> this issue is all about how to interpret the servlet specification. 
> 
> The servlet 3.0 pdf doesn't give any details on the getRequestURI() 
> method, so the only clue I can find is the getRequestURI() javadoc 
> text:
>   "The web container does not decode this String."
> but the examples given in javadoc only illustrates the removal of
> query string and don't go into any kind of encoding.
> 
> So the question is if the javadoc "does not decode" text:
> - only applies to URL-encoding (so non-URL-encoded values should
>   go through character set decoding)
> - or, applies also when only character encoding is used (in which 
>   case I think the specification has a bug, as getRequestURI() 
>   then should return byte[])
> ?
> 
> [Naturally, not doing URL-decoding also means that the underlying
> character encoding remains untouched. The "bug" here is when only
> character encoding is present. F ex, this appears in some mod_jk
> configurations.]
> 

Hi.
(being in a  contest with Mark E. here,)
My 2.5 cent, as someone who is not an expert at Java nor Tomcat per se, but who has spent 
an extensive amount of time on the question of dealing with multiple character sets in a 
web context.

I believe that your example #2 above is simply illegal.
One is not supposed to send such bytes in a URL without URL-encoding them.
That's per the HTTP RFC itself :
RFC 2616 3.2.2 & 3.2.3 (http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.2.2)
-> RFC 2396 part 2. URI Characters and Escape Sequences
(http://www.ietf.org/rfc/rfc2396.txt)

And I believe that the fact that Tomcat is returning the "correct" translation in the 
corresponding request.getPathInfo() is purely accidental, and it could be argued that this 
is a bug in Tomcat : the request should probably have been rejected, because the requested 
URL was invalid.
But it was not rejected, so it filtered further down, and because you did specify that the 
URL-encoding was to be seen as UTF-8, something further down the line converted this 
2-byte UTF-8 sequence in the appropriate internal representation of the character "ä" in 
Java, as seen in your logging of request.getPathInfo().

(See RFC 2616, 5.1.2 Request-URI :
"The Request-URI is transmitted in the format specified in section 3.2.1. If the 
Request-URI is encoded using the "% HEX HEX" encoding [42], the origin server MUST decode 
the Request-URI in order to properly interpret the request. Servers SHOULD respond to 
invalid Request-URIs with an appropriate status code. ")


So if we disregard this invalid URL example #2 (since it is invalid and thus any further 
behaviour could be considered as "undefined"), we are left with the general case #1.

The RFCs 2616 and 2396 do not mandate any specific character set/encoding for the request.
The only thing that they say, is that if the request contains bytes other than the ones 
considered as "reserved" or "safe", they should be "URL-encoded" prior to transmission by 
the client to the server; and that the first thing that the server should do on reception, 
is to "URL-decode" them and restore the original bytes representation, as the client meant 
to send them.

And here is one area where the specs are failing : there is no way, in the HTTP protocol, 
for the client to indicate to the server what the original character set/encoding of the 
URL is; so how can the server know ?

My own interpretation would be as follows :
- in the absence of any other information, the URL after URL-decoding should be viewed as 
being in the ISO-8859-1 encoding, as this is the "default character set/encoding" for HTTP 
(1.1) in general.
- and any other interpretation depends on a prior agreement between client and server.

And the URIEncoding attribute of the Tomcat Connector can be considered as such a prior 
client-server agreement, like : "in all the applications accessed through this Connector, 
the client and the server agree beforehand that any URLs requested by the client will be 
Unicode, UTF-8 encoded".

In other words, if your application can guarantee that any request URL sent by one of its 
cients will be UTF-8 encoded, /then/ you can use the URIEncoding="UTF-8" attribute in 
Tomcat.  And only then.
(because e.g. if one of the client users /types/ a URL in the URL bar of his browser, and 
this URL happens to target your Tomcat application, you can never be sure that the URL 
will be UTF-8 encoded when the browser sends it, because that depends on the settings in 
the browser)

The URIencoding attribute is something which Tomcat adds, outside the HTTP specification 
(and even outside the Servlet Spec, AFAIK), to make life easier for the Tomcat application 
programmers : because Tomcat webapps are written in Java; because the internal character 
set of Java is Unicode; and because it is likely, on a Tomcat host, that all static and 
JSP pages will be saved as UTF-8 encoded, therefore it is easier to allow the programmer 
to just "assume" that when he uses request.getPathInfo() (or similar calls like 
request.getParameters()), he will get a Java string, properly decoded, if the client sent 
it that way (which in the general case it would mostly do).

And then, to get back to the initial question, I would assume that request.getRequestURI() 
is really meant as a "low-level" call, which returns the request URI "as is", before /any/ 
interpretation has taken place (not even the URL-decoding (which should happen first), and 
much less any character set decoding (which should happen later)).
While the other calls (like request.getPathInfo() are higher-level calls, which return 
strings which have already been URL-decoded and character-set decoded.


And if you want to see the underlying issues in all their glory, I suggest the following 
experiment :
1) in a Linux system's shell window, set your locale to one based on UTF-8. (and make sure 
that your "terminal" is also set that way).
    Then inside one of your webapp's directories, create a file named "ÄÖÜ.txt" (I am 
assuming that you can enter that, considering your examples above), with some text A in 
it.  After creating the file, do an "ls" and a "cat" to see what you got.
2) change your locale and client settings to one based on ISO-8859-1, and create another 
file named "ÄÖÜ.txt", with some different text B content.  Do an "ls" and a "cat" again, 
to see that you really have 2 files with different names and contents.
3) now use a browser (preferably IE for once), and try to request either one of these 
files through Tomcat, by typing your request in the browser's URL bar.
You can play around with the settings of the browser (send URLs as ..), with the 
URIencoding attribute in the Tomcat Connector, and the "locale" under which Tomcat is started.
To vary a bit, you can also try to put the corresponding links in a couple of html pages, 
with different encodings for the pages.
For even more fun, you can also create a little webapp which will accept the name of the 
desired file as a request parameter, open it and return its content.

It is only to English-speaking Java programmers writing English-speaking applications that 
the matter may appear simple and settled.



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


RE: getRequestURI() in relation to Connector.URIEncoding

Posted by Mike Wilson <mi...@hotmail.com>.
Hi Chris,

I'm aware of the two levels of encoding but I'm wondering whether 
servlet specification writers were :-)
Here are two examples from Tomcat 7 running with URIEncoding="UTF-8".

Example 1: path /ä in URL-encoded Unicode as sent from browser
  GET /%C3%A4
  request.getRequestURI() -> "/%C3%A4"
  request.getPathInfo()   -> "/ä"

Example 2: path /ä in "binary" Unicode
  GET /.. [0xC3,0xA4]
  request.getRequestURI() -> "/.." [0xC3,0xA4]
  request.getPathInfo()   -> "/ä"

So here we can see that getRequestURI() returns the path completely
undecoded, ie doesn't apply URL decoding nor character decoding. In
example 1 this is what I expected, but in example 2 the result is
that getRequestURI() returns a String containing undecoded binary.
I would expect a String to have been converted to the appropriate
character set, otherwise the method should return a byte[].

Internally Tomcat deals with both these examples as we can see
getPathInfo() always return the correct decoded path, so I guess 
this issue is all about how to interpret the servlet specification. 

The servlet 3.0 pdf doesn't give any details on the getRequestURI() 
method, so the only clue I can find is the getRequestURI() javadoc 
text:
  "The web container does not decode this String."
but the examples given in javadoc only illustrates the removal of
query string and don't go into any kind of encoding.

So the question is if the javadoc "does not decode" text:
- only applies to URL-encoding (so non-URL-encoded values should
  go through character set decoding)
- or, applies also when only character encoding is used (in which 
  case I think the specification has a bug, as getRequestURI() 
  then should return byte[])
?

[Naturally, not doing URL-decoding also means that the underlying
character encoding remains untouched. The "bug" here is when only
character encoding is present. F ex, this appears in some mod_jk
configurations.]

Best regards
Mike

Christopher Schultz wrote:
> Mike,
> 
> On 2/14/13 9:51 AM, Mike Wilson wrote:
> > I can see that even if you specify URIEncoding=UTF-8 in
> > server.xml, calls to HttpServletRequest.getRequestURI() will still
> > return an undecoded String. (This is probably because of the
> > "specification text" in javadoc: "The web container does not decode
> > this String.")
> > 
> > My question is if this behaviour has changed throughout Tomcat 
> > versions?
> > 
> > We got problems with this when upgrading to Tomcat 7, and it seems 
> > we have been getting decoded strings previously when we were using 
> > Jboss 4 (based an Tomcat 5.5 IIRC).
> 
> I think you may be confusing character encoding versus URL encoding.
> The <Connector>'s URIEncoding is a character encoding (e.g.
> ISO-8859-1, UTF-8, etc.) that will be used to convert bytes into
> characters while URL encoding is the transformation of characters like
> "+" into spaces, %-decoding, etc.
> 
> What kind of encoding is (or isn't) happening that seems surprising?
> 
> - -chris
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG/MacGPG2 v2.0.17 (Darwin)
> Comment: GPGTools - http://gpgtools.org
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
> 
> iEYEAREIAAYFAlEdQ6AACgkQ9CaO5/Lv0PCaDwCgkM6PsHbdLNEcHa+Tl6ZsNrWk
> D/sAoMCTm5yBd/Uzm19K/zxJ5oS/6CWr
> =eqtR
> -----END PGP SIGNATURE-----


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: getRequestURI() in relation to Connector.URIEncoding

Posted by Christopher Schultz <ch...@christopherschultz.net>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Mike,

On 2/14/13 9:51 AM, Mike Wilson wrote:
> I can see that even if you specify URIEncoding=UTF-8 in
> server.xml, calls to HttpServletRequest.getRequestURI() will still
> return an undecoded String. (This is probably because of the
> "specification text" in javadoc: "The web container does not decode
> this String.")
> 
> My question is if this behaviour has changed throughout Tomcat 
> versions?
> 
> We got problems with this when upgrading to Tomcat 7, and it seems 
> we have been getting decoded strings previously when we were using 
> Jboss 4 (based an Tomcat 5.5 IIRC).

I think you may be confusing character encoding versus URL encoding.
The <Connector>'s URIEncoding is a character encoding (e.g.
ISO-8859-1, UTF-8, etc.) that will be used to convert bytes into
characters while URL encoding is the transformation of characters like
"+" into spaces, %-decoding, etc.

What kind of encoding is (or isn't) happening that seems surprising?

- -chris
-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.17 (Darwin)
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEAREIAAYFAlEdQ6AACgkQ9CaO5/Lv0PCaDwCgkM6PsHbdLNEcHa+Tl6ZsNrWk
D/sAoMCTm5yBd/Uzm19K/zxJ5oS/6CWr
=eqtR
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org