You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@tomcat.apache.org by Tobias Brennecke <tb...@headissue.com> on 2017/04/21 16:27:41 UTC

+ character is always encoded in url path after request dispatch since #59317

Dear list,

in https://bz.apache.org/bugzilla/show_bug.cgi?id=59317
HttpServletRequest.getRequestURI() has been changed for Tomcat 7.0.70
onwards to always return an encoded URI, which matches the servlet 3.0
specification. However the encoding for the path component of the url
seems to be incorrect, so I wanted to raise the issue on the mailing
list first before opening a bug ticket. I could not find any other
related ticket on the bug tracker or any newer discussion in the mailing
list archives since the problem was fixed.

My apologies if this mail is a bit lengthy, but please bear with me as I
want to provide a thorough problem description.

The dispatchersUseEncodedPaths context attribute has been introduced in
Ticket #59317 to revert to the "old" behavior. Still this is broken, as
it seems to encode + characters in dispatched URIs no matter if setting
the value to "true" or "false". (That is, the + is not kept literally.)
Please note that "+" is a perfectly valid character in the path
component of an URL and has no special meaning (e.g. as a space as for a
query string like ?foo=bar+baz). For instance it is used by Google as
literal character https://plus.google.com/+Google while
https://plus.google.com/%20Google returns a 404.
I will return to the details of whether + is a valid character in an url
path further below.

= Problem statement =
We are using + characters in URLs like
https://www.example.com/myservlet/url+with+spaces/sub+url.html which is
handled by a HttpServlet.
For each of these URL's there is also a prefixed version for partners, e.g.
https://example.com/prefix/myservlet/url+with+spaces/sub+url.html

Now if such a prefix is encountered, it gets removed by a servlet Filter
and the request is dispatched to the URL without the prefix, e.g.
/prefix/myservlet/url+with+spaces/sub+url.html is dispatched to
/myservlet/url+with+spaces/sub+url.html, which in turn is handled by the
HttpServlet.
(That is in a Filter:
request.getRequestDispatcher("/myservlet/url+with+spaces/sub+url.html").forward(request,
response);)
Now when calling HttpServletRequest.getRequestURIin the Servlet, the
return values are as follows:

For Tomcat <= 7.0.69:
Calling the url directly: /myservlet/url+with+spaces/sub+url.html
Calling the url with a prefix: /myservlet/url+with+spaces/sub+url.html

Since 7.0.70 the return value of request.getRequestUri() from the
Servlet is very inconsistent:
Calling the URL directly: /myservlet/url+with+spaces/sub+url.html

Now depending on the value of dispatchersUseEncodedPaths:
Calling the prefixed URL and "false" (Note the %2B instead of +):
/myservlet/url%2Bwith%2Bspaces/sub%2Burl.html
Calling the prefixed URL and "true" (+ is replaced by %20):
/myservlet/url%20with%20spaces/sub%20url.html

In any case, this does not match the value as if the url was called
directly and worse the default behavior is not equivalent to the
original url.
The expected behavior here is that instead of encoding the "+" for
"false" or replacing it by a space, it should not be encoded at all.

The reason is that in the catalina URLEncoder.DEFAULT at
https://github.com/apache/tomcat/blob/trunk/java/org/apache/catalina/util/URLEncoder.java
"+" is not in the list of safe characters.
As URLEncoder.DEFAULT is used in all places of the changeset for the bug
ticket  #59317 from the beginning of this mail, "+" characters will
always be encoded. See
https://github.com/apache/tomcat/commit/eb195bebac8239b994fa921aeedb136a93e4ccaf#diff-8b91a9296e19012bf6be4bdf975fab0d
for details.

= On the validity of "+" in URLs =
An HTTP url typically consists of a protocol, host, path and query. Lets
focus on the last two: For /foo+bar?baz=a+b the path is /foo+bar and the
query baz=a+b.
While in the query string the + character has a special meaning as a
space, this is not the case for the path, i.e. it is just a regular
character. Although the encoding of path and query string are somewhat
similar, they are NOT the same!
The query is specified as application/www-form-urlencoded, but the path
is not.
 
= See also =
Question on stack overflow:
stackoverflow.com/questions/1005676/urls-and-plus-signs
Blog Post listing valid characters in URI components, see section "The
reserved characters are different for each part":
https://web-beta.archive.org/web/20150509184317/http://blog.lunatech.com:80/2009/02/03/what-every-web-developer-must-know-about-url-encoding

According RFCs:
https://tools.ietf.org/html/rfc3986#section-2.2
https://tools.ietf.org/html/rfc3986#section-2.3
Note that the set of reserved characters is different for each scheme
and URI component as also stated in the blog post above.

Definition of the HTTP URI scheme in RFC 7230, section 2.7.1/2.7.3) (p.
17ff):
https://tools.ietf.org/html/rfc7230

To my knowledge there is no place in the above RFCs stating that a +
must be encoded in the path component of an URI or that it has a special
meaning (unlike in query strings).

Follow-up discussion after #59317 was fixed:
http://marc.info/?l=tomcat-user&m=146800805502015

= How do other servlet containers handle this? =
For Jetty I found the following issue:
https://bugs.eclipse.org/bugs/show_bug.cgi?id=435017

= Reproducing the problem =
I created the following Gist (to keep this mail shorter):
https://gist.github.com/tburny/468e635c176752f21251fc641450594d
I ran this with Tomcat 7.0.69 and 7.0.77, but I would assume that all 
versions affected by #59317 are also affected by the behavior I  described.

My question is whether this behavior is intended or if this is a bug.

As I'm a native German speaker, I apologize for any grammar mistakes or
misspellings. Thank you for your efforts and patience while reading this
mail.


Kind regards,

Tobias Brennecke


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: + character is always encoded in url path after request dispatch since #59317

Posted by Mark Thomas <ma...@apache.org>.
On 24/04/17 15:57, Tobias Brennecke wrote:
> On 24/04/17 14:24, Mark Thomas wrote:
> 
>>
>>> That looks like a bug to me.
>>>
>>> We need to go through and check all the places URLs are being encoding
>>> and check it is being done correctly.
>> This is now complete for 9.0.x, 8.5.x, 8.0.x and 7.0.x.
>>
>>> I am concerned that fixing this may break apps that rely on the broken
>>> behaviour. We might need to provide some configuration to work around this.
>> I haven't provided any configuration options yet. Any issues will be
>> considered on a case by case basis.
>>
>> Mark
> 
> Hi Mark,
> 
> thanks a lot for looking into this! I already saw you added this to the
> (unreleased) changelog.
> Just let me know if you could use some help, if there's still anything
> to do?
> Maybe with a patch or by improving unit tests and documentation?

Hi Tobias,

The Tomcat community always welcomes additional help.

Additional unit tests are always would be welcome. Generally, a new unit
test should aim to improve (however slightly) the code code coverage [1].

In terms of where to start looking, URLEncoder is pretty well covered.
It looks like it is the error cases that need testing. In terms of the
other functionality, the test I had to fix after applying this patch [2]
might give you some ideas.

If there are areas of documentation you think could be improved then
patches for those changes would also be very welcome.

Equally, please take the above as suggestions rather than as an
instruction. Anywhere you'd like to start contributing would be great.

Kind regards,

Mark

[1] https://ci.apache.org/projects/tomcat/tomcat9/coverage/
[2] http://svn.apache.org/viewvc?rev=1792468&view=rev

> 
> 
> 
> Many regards,
> 
> Tobias
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> For additional commands, e-mail: users-help@tomcat.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: + character is always encoded in url path after request dispatch since #59317

Posted by Tobias Brennecke <tb...@headissue.com>.
On 24/04/17 14:24, Mark Thomas wrote:

>
>> That looks like a bug to me.
>>
>> We need to go through and check all the places URLs are being encoding
>> and check it is being done correctly.
> This is now complete for 9.0.x, 8.5.x, 8.0.x and 7.0.x.
>
>> I am concerned that fixing this may break apps that rely on the broken
>> behaviour. We might need to provide some configuration to work around this.
> I haven't provided any configuration options yet. Any issues will be
> considered on a case by case basis.
>
> Mark

Hi Mark,

thanks a lot for looking into this! I already saw you added this to the
(unreleased) changelog.
Just let me know if you could use some help, if there's still anything
to do?
Maybe with a patch or by improving unit tests and documentation?



Many regards,

Tobias

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


RE: + character is always encoded in url path after request dispatch since #59317

Posted by Naga Ramesh <na...@manthan.com>.
Thanks for providing the information & confirm me what are all the details required from my end.

Many times we are facing the same issues with the same revision tomcat setup only, PFA e-mail for your ref:

Regards,
Naga Ramesh R
9972229728

-----Original Message-----
From: Mark Thomas [mailto:markt@apache.org] 
Sent: Monday, April 24, 2017 5:54 PM
To: Tomcat Users List
Subject: Re: + character is always encoded in url path after request dispatch since #59317

On 21/04/17 20:54, Mark Thomas wrote:
> On 21/04/17 17:27, Tobias Brennecke wrote:
> 
> <snip/>
> 
>> My question is whether this behavior is intended or if this is a bug.
> 
> That looks like a bug to me.
> 
> We need to go through and check all the places URLs are being encoding 
> and check it is being done correctly.

This is now complete for 9.0.x, 8.5.x, 8.0.x and 7.0.x.

> I am concerned that fixing this may break apps that rely on the broken 
> behaviour. We might need to provide some configuration to work around this.

I haven't provided any configuration options yet. Any issues will be considered on a case by case basis.

Mark


> 
>> As I'm a native German speaker, I apologize for any grammar mistakes 
>> or misspellings. Thank you for your efforts and patience while 
>> reading this mail.
> 
> No need to apologise. I had no idea you weren't a native English 
> speaker (I am) until I read that last sentence.
> 
> Kind regards,
> 
> Mark
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> For additional commands, e-mail: users-help@tomcat.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org

Re: + character is always encoded in url path after request dispatch since #59317

Posted by Mark Thomas <ma...@apache.org>.
On 21/04/17 20:54, Mark Thomas wrote:
> On 21/04/17 17:27, Tobias Brennecke wrote:
> 
> <snip/>
> 
>> My question is whether this behavior is intended or if this is a bug.
> 
> That looks like a bug to me.
> 
> We need to go through and check all the places URLs are being encoding
> and check it is being done correctly.

This is now complete for 9.0.x, 8.5.x, 8.0.x and 7.0.x.

> I am concerned that fixing this may break apps that rely on the broken
> behaviour. We might need to provide some configuration to work around this.

I haven't provided any configuration options yet. Any issues will be
considered on a case by case basis.

Mark


> 
>> As I'm a native German speaker, I apologize for any grammar mistakes or
>> misspellings. Thank you for your efforts and patience while reading this
>> mail.
> 
> No need to apologise. I had no idea you weren't a native English speaker
> (I am) until I read that last sentence.
> 
> Kind regards,
> 
> Mark
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> For additional commands, e-mail: users-help@tomcat.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: + character is always encoded in url path after request dispatch since #59317

Posted by Mark Thomas <ma...@apache.org>.
On 21/04/17 17:27, Tobias Brennecke wrote:

<snip/>

> My question is whether this behavior is intended or if this is a bug.

That looks like a bug to me.

We need to go through and check all the places URLs are being encoding
and check it is being done correctly.

I am concerned that fixing this may break apps that rely on the broken
behaviour. We might need to provide some configuration to work around this.

> As I'm a native German speaker, I apologize for any grammar mistakes or
> misspellings. Thank you for your efforts and patience while reading this
> mail.

No need to apologise. I had no idea you weren't a native English speaker
(I am) until I read that last sentence.

Kind regards,

Mark


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org