You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@tomcat.apache.org by asbachb <mz...@gmail.com> on 2011/01/27 01:15:24 UTC

ServletWebRequest.getServletPath() returns strange values on uris with german umlauts

Hello,

Environment: Tomcat 7.0.6 - Java 1.6_22 - Ubuntu 10.10

I got some problems with a wicket application which decodes parameters into
path of the uri. When calling getServletPath() on that uris, containig
umlauts the method will return some wrong encoded path.
In my sample application the method returns "/page/param/vÃ¤lue-xxx" but I
would expect "/page/param/välue-xxx".

Like mentioned in the CharacterEncoding FAQs I already setup URIEncoding and
useBodyEncodingForURI attribute in my server.xml configuration.

Feel free to have a look at my sample application attached.

http://old.nabble.com/file/p30770590/wicket-umlauts-1.0-SNAPSHOT.war
wicket-umlauts-1.0-SNAPSHOT.war 

I tested the same Application also in the bundled netbeans version 6.0.29
and latest GlassFish version which worked without problems.

Any ideas how to solve that issue?

Kind Regards,
Benjamin
-- 
View this message in context: http://old.nabble.com/ServletWebRequest.getServletPath%28%29-returns-strange-values-on-uris-with-german-umlauts-tp30770590p30770590.html
Sent from the Tomcat - User mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org

Re: ServletWebRequest.getServletPath() returns strange values on uris with german umlauts

Posted by André Warnier <aw...@ice-sa.com>.

asbachb wrote:
> Thanks for you reply.
> 
> I checked my clients request to tomcat which shows that the umlauts are
> correctly replaced with their enities:
> 
> GET
> "http://localhost:8080/wicket-umlauts-1.0-SNAPSHOT/page/param/v%C3%A4lue-xxx"
> 
> This request should be a valid ASCII request and shouldn't be a problem to
> decode?
> 
> 
I understand what you mean, and you are right, but in a case like this you have to be very 
careful in your use of vocabulary.
The term "ASCII" is usually reserved for talking about a character set (or alphabet) which 
includes only 128 codes, represented by one byte per character, of which the letters are 
A-Z and a-z.  Basically thus the English alphabet.
An "umlaut" is a diacritic mark.
An "lowercase a with umlaut" is a letter of the German alphabet (and probably others).
The term "entity" is usually used in the context of XML or HTML, to denote something of 
the form "&xxx;" where "xxx" represents the name of a symbol.
And "/wicket-umlauts-1.0-SNAPSHOT/page/param/v%C3%A4lue-xxx" seems to be the result of 2 
consecutive steps :
a) the client composes a URL as a Unicode String, and encodes it using the UTF-8 encoding
b) after (a), it scans this URL for any byte/character that is not valid in a URL (as per 
RFC 2396) and "URL-encodes" it, which consists of replacing the offending byte by its 
encoding as "%xy", where "xy" is the hexadecimal representation of the byte value.

The server, when it receives this request,
c) "URL-decodes" the URL, replacing each "%xy" sequence by the corresponding single-byte code
d) and then, it depends..
If you have told the server to decode the URL (after (c)) as if it was UTF-8/Unicode, then 
the server will do that, to generate an internal Java Unicode String.

This is not the default.  You have to tell the server to do that.  With Tomcat, you do 
that by using the 'URIencoding="UTF-8"' attribute of the Connector.
(You cannot in this case use the "useBodyEncodingForURI" atribute, because for a GET 
request, there is no body (and thus no body encoding of course)).

If you have done that, and your application asks Tomcat for the URL String directly, then 
you should get the correct Java (Unicode) String in response.
(You should be able to check this easily with a simple JSP page).

Now if you get this path via a call specific to the "wicket" application you are using, 
then you have to check in that application what happens, to make the result different.
Maybe this "wicket" thing does its own decoding of the path, resulting in a (wrong) 
double-decoding ?

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org

Re: ServletWebRequest.getServletPath() returns strange values on uris with german umlauts

Posted by asbachb <mz...@gmail.com>.

Thanks for you reply.

I checked my clients request to tomcat which shows that the umlauts are
correctly replaced with their enities:

GET
"http://localhost:8080/wicket-umlauts-1.0-SNAPSHOT/page/param/v%C3%A4lue-xxx"

This request should be a valid ASCII request and shouldn't be a problem to
decode?



awarnier wrote:
> 
> asbachb wrote:
>> Thank you for you reply.
>> 
>> Sorry for expressing me a little vague. I meant that i alread tried both
>> attributes. 
>> 
>> My used encoding is UTF-8.
>> 
>> Here are the missing sources:
>> 
>> http://old.nabble.com/file/p30775449/wicket-umlauts.zip
>> wicket-umlauts.zip 
>> 
>> 
>> Konstantin Kolinko wrote:
>>> 2011/1/27 asbachb <mz...@gmail.com>:
>>>> Like mentioned in the CharacterEncoding FAQs I already setup
>>>> URIEncoding
>>>> and
>>>> useBodyEncodingForURI attribute in my server.xml configuration.
>>> URIEncoding and useBodyEncodingForURI  are alternatives. Do not use
>>> both at the same time.  My understanding of /docs/config/http.html is
>>> that useBodyEncodingForURI overrides URIEncoding. So, what is your
>>> "body encoding" in this case?
>>>
>>>
> Maybe the first thing to remember is this :
> http://tools.ietf.org/html/rfc2396
> Section : 2.1 URI and non-ASCII characters
> 
> In other words : a URI /does not/ have any specific encoding, nor is there
> any way in the 
> HTTP protocol of specifying one.
> So, whatever you do in terms of interpreting this URI, depends on an
> agreement between the 
> client and the server.
> /If/ you can be sure that all the cients accessing your application will
> always encode the 
> URI of their request using charset/encoding XYZ, /then/ you can decide to
> decode this URI 
> at the server side using the same charset/encoding.
> And otherwise, well, you have a problem.
> And within the limitations of the current HTTP protocol, that problem
> cannot be solved 
> entirely.
> 
> In other words also, the Tomcat attributes useBodyEncodingForURI and
> URIEncoding are a way 
> for you to influence how the server side will decode the URI's, but they
> cannot do 
> anything about how the clients are encoding them.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> For additional commands, e-mail: users-help@tomcat.apache.org
> 
> 
> 

-- 
View this message in context: http://old.nabble.com/ServletWebRequest.getServletPath%28%29-returns-strange-values-on-uris-with-german-umlauts-tp30770590p30778001.html
Sent from the Tomcat - User mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org

Re: ServletWebRequest.getServletPath() returns strange values on uris with german umlauts

Posted by André Warnier <aw...@ice-sa.com>.

asbachb wrote:
> Thank you for you reply.
> 
> Sorry for expressing me a little vague. I meant that i alread tried both
> attributes. 
> 
> My used encoding is UTF-8.
> 
> Here are the missing sources:
> 
> http://old.nabble.com/file/p30775449/wicket-umlauts.zip wicket-umlauts.zip 
> 
> 
> Konstantin Kolinko wrote:
>> 2011/1/27 asbachb <mz...@gmail.com>:
>>> Like mentioned in the CharacterEncoding FAQs I already setup URIEncoding
>>> and
>>> useBodyEncodingForURI attribute in my server.xml configuration.
>> URIEncoding and useBodyEncodingForURI  are alternatives. Do not use
>> both at the same time.  My understanding of /docs/config/http.html is
>> that useBodyEncodingForURI overrides URIEncoding. So, what is your
>> "body encoding" in this case?
>>
>>
Maybe the first thing to remember is this :
http://tools.ietf.org/html/rfc2396
Section : 2.1 URI and non-ASCII characters

In other words : a URI /does not/ have any specific encoding, nor is there any way in the 
HTTP protocol of specifying one.
So, whatever you do in terms of interpreting this URI, depends on an agreement between the 
client and the server.
/If/ you can be sure that all the cients accessing your application will always encode the 
URI of their request using charset/encoding XYZ, /then/ you can decide to decode this URI 
at the server side using the same charset/encoding.
And otherwise, well, you have a problem.
And within the limitations of the current HTTP protocol, that problem cannot be solved 
entirely.

In other words also, the Tomcat attributes useBodyEncodingForURI and URIEncoding are a way 
for you to influence how the server side will decode the URI's, but they cannot do 
anything about how the clients are encoding them.

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org

Re: ServletWebRequest.getServletPath() returns strange values on uris with german umlauts

Posted by asbachb <mz...@gmail.com>.

Thank you for you reply.

Sorry for expressing me a little vague. I meant that i alread tried both
attributes. 

My used encoding is UTF-8.

Here are the missing sources:

http://old.nabble.com/file/p30775449/wicket-umlauts.zip wicket-umlauts.zip 


Konstantin Kolinko wrote:
> 
> 2011/1/27 asbachb <mz...@gmail.com>:
>> Like mentioned in the CharacterEncoding FAQs I already setup URIEncoding
>> and
>> useBodyEncodingForURI attribute in my server.xml configuration.
> 
> URIEncoding and useBodyEncodingForURI  are alternatives. Do not use
> both at the same time.  My understanding of /docs/config/http.html is
> that useBodyEncodingForURI overrides URIEncoding. So, what is your
> "body encoding" in this case?
> 
> 
>> wicket-umlauts-1.0-SNAPSHOT.war
> 
> Where is its source code?
> 
>> META-INF/context.xml
>> <Context antiJARLocking="true" path="/wicket-umlauts-1.0-SNAPSHOT"/>
> 
> Do not use "path" in context.xml that is inside the webapp. That
> attribute is ignored there.
> 
> 
> Best regards,
> Konstantin Kolinko
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> For additional commands, e-mail: users-help@tomcat.apache.org
> 
> 
> 

-- 
View this message in context: http://old.nabble.com/ServletWebRequest.getServletPath%28%29-returns-strange-values-on-uris-with-german-umlauts-tp30770590p30775449.html
Sent from the Tomcat - User mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org

Re: ServletWebRequest.getServletPath() returns strange values on uris with german umlauts

Posted by Konstantin Kolinko <kn...@gmail.com>.

2011/1/27 asbachb <mz...@gmail.com>:
> Like mentioned in the CharacterEncoding FAQs I already setup URIEncoding and
> useBodyEncodingForURI attribute in my server.xml configuration.

URIEncoding and useBodyEncodingForURI  are alternatives. Do not use
both at the same time.  My understanding of /docs/config/http.html is
that useBodyEncodingForURI overrides URIEncoding. So, what is your
"body encoding" in this case?


> wicket-umlauts-1.0-SNAPSHOT.war

Where is its source code?

> META-INF/context.xml
> <Context antiJARLocking="true" path="/wicket-umlauts-1.0-SNAPSHOT"/>

Do not use "path" in context.xml that is inside the webapp. That
attribute is ignored there.


Best regards,
Konstantin Kolinko

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org