You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@tomcat.apache.org by André Warnier <aw...@ice-sa.com> on 2011/12/04 22:57:13 UTC

Character set issue

Hi.

I need help with a problem on a Tomcat system.  The system is of difficult access, and I 
cannot access it directly right now (this is Sunday night in Europe).
I know that the system runs Tomcat 6.something, under Oracle/Sun Java 1.6, and that's all 
I can say right now. The platform is RedHat RHEL, current version.

The problem which happens is that, after the update of a webapp (of which I do not have 
the code), it seems that non-US-English "diacritic" characters posted to the webapp from a 
web <form>, are now "corrupted". And I would like to understand better the Tomcat 
mechanism for reading HTTP request form parameters, so that I can start to figure out what 
is going wrong.

The webapp consists of a single servlet, wrapped by two filters.
The application's web.xml defines the order as
filter1
filter2
servlet
with both filters processing all requests to the servlet.

"filter1" is a commercial product used on many Tomcat sites.
"filter2" is my own filter (and it is the only part of which I have the source code)
"servlet" is also a commercial product of which I do not have the code, and the one which 
has just been updated.

What I would like to know is : with a setup such as the above, how does Tomcat determine 
in which /character set/ the body of the POST will be read ?

For example :
Suppose that we have 2 html forms, form1 and form2.  Both forms are functionally 
identical, and contain a text input box named "name1".
The form form1 has an html declaration which specifies it as having the charset "iso-8859-1".
The form form2 has an html declaration which specifies it as having the charset "UTF-8".

The user, in the input box "name1" of each form, types the string "TÜV" (second character 
= uppercase U with umlaut) and then posts the form to the webapp.
The user browser is the same in all cases.

If the servlet executes a request.getParameter("name1"), what are the factors which can 
determine how it receives the value of this parameter ?

Or maybe my question should be : /can/ the servlet (or one of the filters) do anything 
that would cause the value of "name1" to /not/ be a correct Java "TÜV" string in the servlet ?

Additional information :
Only the servlet was updated.  Prior to that update, the application worked correctly. So 
I strongly suspect that it is the updated servlet which creates the problem.  But I'd like 
to understand /how/ it can create such a problem, and if for example something in filter1 
or filter2 could contribute to the problem, or not.
Filter1 is an authentication servlet filter, and as far as I know it only checks HTTP 
headers, and does not concern itself with the body of the request.  But I suppose that 
even the request body "passes through" this filter, and that it could presumably corrupt 
this body (although I would consider this unlikely right now).
Filter2 is my own filter (and I am not a Java expert).  This filter works at a number of 
installations (and also here, before this servlet update).  It subclasses the HTTP 
request, because it needs to add a HTTP header to the request, on-the-fly.  But the 
subclass only overrides the methods which have to do with the HTTP headers, and does not 
handle the body directly.

Any information or ideas welcome.



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: Character set issue

Posted by Christopher Schultz <ch...@christopherschultz.net>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Konstantin,

On 12/6/11 12:06 PM, Konstantin Kolinko wrote:
> 2011/12/6 Christopher Schultz <ch...@christopherschultz.net>:
>> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
>> 
>> Konstantin,
>> 
>> On 12/6/11 11:15 AM, Konstantin Kolinko wrote:
>>> 1. I do not use valves that call getParameter(), so I have not 
>>> seen the need, but the FormAuthenticator will need the
>>> feature?
>> 
>> ExtendedAccessLogValve can also cause the query string to be
>> parsed if "x-P(XXX)" is specified, form authentication will
>> certainly call getParameter, etc.
> 
> The logging happens when request processing is already done. So 
> AccessLogValve can rely on a Filter in most cases.

Good call.

- -chris
-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.17 (Darwin)
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk7eg6QACgkQ9CaO5/Lv0PD+4gCfVGl1r7lsPSNpft5osajCc6Xy
QnQAn2OWO9yx9bXrtcXJrALizU6VLDnQ
=8uho
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: Character set issue

Posted by Konstantin Kolinko <kn...@gmail.com>.
2011/12/6 Christopher Schultz <ch...@christopherschultz.net>:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Konstantin,
>
> On 12/6/11 11:15 AM, Konstantin Kolinko wrote:
>> 1. I do not use valves that call getParameter(), so I have not
>> seen the need, but the FormAuthenticator will need the feature?
>
> ExtendedAccessLogValve can also cause the query string to be parsed if
> "x-P(XXX)" is specified, form authentication will certainly call
> getParameter, etc.

The logging happens when request processing is already done. So
AccessLogValve can rely on a Filter in most cases.

Best regards,
Konstantin Kolinko

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: Character set issue

Posted by Christopher Schultz <ch...@christopherschultz.net>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

André,

On 12/6/11 3:09 PM, André Warnier wrote:
> In fact, does anyone in this Tomcat world really know exactly why
> no standardisation committee or group of experts has yet come up
> with an RFC for HTTP 2.0 and an RFC for HTML 10.0 (or whatever the
> next major number is) where the default would be Unicode/UTF-8 for
> *everything* ?

There's no real compelling reason to go to HTTP 2.0 other than maybe
websockets and this particular character encoding issue. Since
everyone would say "oh, well, it's gotta be backward-compatible
anyway" and use HTTP 1.1 defaults for character encoding, it's not
worth it for the encoding issue, and websockets has its own discovery
capabilities so it's not really worth it for that, either.

> This question has been puzzling me for quite some time.
> 
> The amount of time web developers are spending unproductively
> handling these hairy questions of character encodings and
> translations is absolutely stupendous.

That's why I added the "What do I do to just make things work?"
question to the FAQ and suggested using UTF-8 for everything.

Apparently, we need a section called "Okay, I followed the 'make it
work' section and ... it's still not working" for case like the one
you came up with. Sanity check for things like filters in the wrong
order, etc.

> One would think that rather than spending time inventing yet
> another round of servlet specs or html graphic extensions or
> sub-protocol of SOAP or punycode patch on DNS, someone would come
> up with this more fundamental thing, no ? What is it exactly that
> does not allow this to happen ?

Inertia :)

> Can anyone propose an RFC?

Yes, but I'm sure that the IETF doesn't assign every crazy idea an RFC:
http://www.rfc-editor.org/rfc-editor/instructions2authors.txt

> If yes, any interest by anyone here in participating in such a
> submission?

Meh. *shrugs*

- -chris
-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.17 (Darwin)
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk7ehg8ACgkQ9CaO5/Lv0PDsDgCeMKVCoMg9Yt6k3w3gNoRwKftM
+LkAn3G5m+5lGyVxEFEui7+wJEe211NN
=cW+a
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: Character set issue

Posted by André Warnier <aw...@ice-sa.com>.
Christopher Schultz wrote:
...
> 
> Honestly, the whole world should just set everything to UTF-8 and move
> on with life :)
> 
+5
In fact, does anyone in this Tomcat world really know exactly why no standardisation 
committee or group of experts has yet come up with an RFC for HTTP 2.0 and an RFC for HTML 
10.0 (or whatever the next major number is) where the default would be Unicode/UTF-8 for 
*everything* ?

This question has been puzzling me for quite some time.

The amount of time web developers are spending unproductively handling these hairy 
questions of character encodings and translations is absolutely stupendous. The amount of 
ultimately futile and resource-consuming code having to be written and run to deal with 
them is just as stupendous.
One would think that rather than spending time inventing yet another round of servlet 
specs or html graphic extensions or sub-protocol of SOAP or punycode patch on DNS, someone 
would come up with this more fundamental thing, no ?
What is it exactly that does not allow this to happen ?

Can anyone propose an RFC ? If yes, any interest by anyone here in participating in such a 
submission ?


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: Character set issue

Posted by Christopher Schultz <ch...@christopherschultz.net>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Konstantin,

On 12/6/11 11:15 AM, Konstantin Kolinko wrote:
> 1. I do not use valves that call getParameter(), so I have not
> seen the need, but the FormAuthenticator will need the feature?

ExtendedAccessLogValve can also cause the query string to be parsed if
"x-P(XXX)" is specified, form authentication will certainly call
getParameter, etc.

> Anyone who have an itch can implement the valve.

Of course :)

> 2. I sometimes wonder whether URIEncoding setting on a Connector
> can be moved to a Context instead.

Honestly, the whole world should just set everything to UTF-8 and move
on with life :)

But you're right: this has little to do with the connector. It just
has to go *somewhere*.

> 3. Maybe backport the move of SetCharacterEncodingFilter to 6.0.

+1

- -chris
-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.17 (Darwin)
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk7eSNwACgkQ9CaO5/Lv0PC3lgCfcN2+nVJQJiDf1Ew4GOAUpMXR
4lMAn0UGmoNP6C+BoGD8X1BchKrdJSx9
=DQjo
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: Character set issue

Posted by Konstantin Kolinko <kn...@gmail.com>.
2011/12/6 Christopher Schultz <ch...@christopherschultz.net>:
> On 12/4/11 8:02 PM, Konstantin Kolinko wrote:
>> Make sure that content type and charset value in a) Content-Type
>> HTTP header sent by server and b) in META tag in HTML text have
>> _literally_ the same value. If they both are present and they do
>> not match, odd things may happen in "non-compliant" browsers.
>
> I almost always do the following at the top of my pages (JSP shows,
> but you get the idea for any templating system):
>
> <html>
> <head>
>  <meta http-equiv="Content-Type"
>        content="text/html<% null == response.getCharacterEncoding()
>                             ? ""
>                             : ";" + response.getCharacterEncoding() %>"
>  />
>
> I do this so that, in case the response encoding gets changed from
> whatever I think it is, I don't report the wrong one. You don't want
> the page to say UTF-8 when the encoding is really SHIFT-JS or
> something else.
>

:) I always do similar, but a bit more simple implementation of the above:
<%= response.getContentType() %>


2011/12/6 Christopher Schultz <ch...@christopherschultz.net>:
> On 12/5/11 6:53 PM, Konstantin Kolinko wrote:
>> Note, that there is standard "SetCharacterEncodingFilter" in Tomcat
>> 7. (In 7.0 it is in o.a.c.filters package, in 6.0 and 5.5 it is
>> examples webapp).
>
> I see that you've moved that out of examples and into the main code base.
>
> Should we also provide a Valve version of this? That way, you can make
> sure that the encoding is set before Valves like the
> AuthenticatorValve fire.
>

1. I do not use valves that call getParameter(), so I have not seen
the need, but the FormAuthenticator will need the feature?

Anyone who have an itch can implement the valve.


2. I sometimes wonder whether URIEncoding setting on a Connector can
be moved to a Context instead.

3. Maybe backport the move of SetCharacterEncodingFilter to 6.0.

Best regards,
Konstantin Kolinko

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: Character set issue

Posted by Christopher Schultz <ch...@christopherschultz.net>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Konstantin,

On 12/4/11 8:02 PM, Konstantin Kolinko wrote:
> Make sure that content type and charset value in a) Content-Type
> HTTP header sent by server and b) in META tag in HTML text have
> _literally_ the same value. If they both are present and they do 
> not match, odd things may happen in "non-compliant" browsers.

I almost always do the following at the top of my pages (JSP shows,
but you get the idea for any templating system):

<html>
<head>
  <meta http-equiv="Content-Type"
        content="text/html<% null == response.getCharacterEncoding()
                             ? ""
                             : ";" + response.getCharacterEncoding() %>"
  />

I do this so that, in case the response encoding gets changed from
whatever I think it is, I don't report the wrong one. You don't want
the page to say UTF-8 when the encoding is really SHIFT-JS or
something else.

- -chris
-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.17 (Darwin)
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk7eMX0ACgkQ9CaO5/Lv0PAciQCgxAvIxddldpVlv4tK/1F+47+X
lIIAnRtDaIg2Tl5zIQiKMtPPjKt6IaVZ
=2YHA
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: Character set issue

Posted by Konstantin Kolinko <kn...@gmail.com>.
2011/12/5 André Warnier <aw...@ice-sa.com>:
> Hi.
>
> I need help with a problem on a Tomcat system.  The system is of difficult
> access, and I cannot access it directly right now (this is Sunday night in
> Europe).
> I know that the system runs Tomcat 6.something, under Oracle/Sun Java 1.6,
> and that's all I can say right now. The platform is RedHat RHEL, current
> version.
>
> The problem which happens is that, after the update of a webapp (of which I
> do not have the code), it seems that non-US-English "diacritic" characters
> posted to the webapp from a web <form>, are now "corrupted". And I would
> like to understand better the Tomcat mechanism for reading HTTP request form
> parameters, so that I can start to figure out what is going wrong.
>
> The webapp consists of a single servlet, wrapped by two filters.
> The application's web.xml defines the order as
> filter1
> filter2
> servlet
> with both filters processing all requests to the servlet.
>
> "filter1" is a commercial product used on many Tomcat sites.
> "filter2" is my own filter (and it is the only part of which I have the
> source code)
> "servlet" is also a commercial product of which I do not have the code, and
> the one which has just been updated.
>
> What I would like to know is : with a setup such as the above, how does
> Tomcat determine in which /character set/ the body of the POST will be read
> ?
>
> For example :
> Suppose that we have 2 html forms, form1 and form2.  Both forms are
> functionally identical, and contain a text input box named "name1".
> The form form1 has an html declaration which specifies it as having the
> charset "iso-8859-1".
> The form form2 has an html declaration which specifies it as having the
> charset "UTF-8".
>
> The user, in the input box "name1" of each form, types the string "TÜV"
> (second character = uppercase U with umlaut) and then posts the form to the
> webapp.
> The user browser is the same in all cases.
>
> If the servlet executes a request.getParameter("name1"), what are the
> factors which can determine how it receives the value of this parameter ?
>
> Or maybe my question should be : /can/ the servlet (or one of the filters)
> do anything that would cause the value of "name1" to /not/ be a correct Java
> "TÜV" string in the servlet ?
>
> Additional information :
> Only the servlet was updated.  Prior to that update, the application worked
> correctly. So I strongly suspect that it is the updated servlet which
> creates the problem.  But I'd like to understand /how/ it can create such a
> problem, and if for example something in filter1 or filter2 could contribute
> to the problem, or not.
> Filter1 is an authentication servlet filter, and as far as I know it only
> checks HTTP headers, and does not concern itself with the body of the
> request.  But I suppose that even the request body "passes through" this
> filter, and that it could presumably corrupt this body (although I would
> consider this unlikely right now).
> Filter2 is my own filter (and I am not a Java expert).  This filter works at
> a number of installations (and also here, before this servlet update).  It
> subclasses the HTTP request, because it needs to add a HTTP header to the
> request, on-the-fly.  But the subclass only overrides the methods which have
> to do with the HTTP headers, and does not handle the body directly.
>
> Any information or ideas welcome.
>

1. I think you know the FAQ:
http://wiki.apache.org/tomcat/FAQ/CharacterEncoding

2. Make sure that the web browser understands what character encoding
the web form uses.

Some browsers remember what encoding was used on the previous page and
use that instead of what is provided by server.
Mixing both ISO-8859-1 and UTF-8 forms on the same site is bad in this sense.

Make sure that content type and charset value in
 a) Content-Type HTTP header sent by server and
 b) in META tag in HTML text
have _literally_ the same value. If they both are present and they do
not match, odd things may happen in "non-compliant" browsers.

3. A servlet or JSP page called as "include" cannot change the content
type (and thus the charset). The <%@page contentType=".."%> directive
will be ignored.


Best regards,
Konstantin Kolinko

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: Character set issue

Posted by Christopher Schultz <ch...@christopherschultz.net>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Konstantin,

On 12/5/11 6:53 PM, Konstantin Kolinko wrote:
> Note, that there is standard "SetCharacterEncodingFilter" in Tomcat
> 7. (In 7.0 it is in o.a.c.filters package, in 6.0 and 5.5 it is
> examples webapp).

I see that you've moved that out of examples and into the main code base.

Should we also provide a Valve version of this? That way, you can make
sure that the encoding is set before Valves like the
AuthenticatorValve fire.

- -chris

-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.17 (Darwin)
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk7eMfIACgkQ9CaO5/Lv0PDYywCdG8VaozuHaGGRqAl2EIQ8mK8d
FsgAniqbtLsbvmJxZJ5iYCMvTY0I/tmk
=TvGM
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: Character set issue

Posted by Konstantin Kolinko <kn...@gmail.com>.
2011/12/6 André Warnier <aw...@ice-sa.com>:
> Marvin Addison wrote:
>>>
>>> /can/ the servlet (or one of the filters)
>>> do anything that would cause the value of "name1" to /not/ be a correct
>>> Java
>>> "TÜV" string in the servlet ?
>>
>>
>> Yes, absolutely.  If this is a posted value and some filter fires that
>> coerces the encoding (e.g. request.getParameter() in the case of POST)
>> of the request, all subsequent filters and the servlet will see the
>> string in the encoding of the first filter.  This is why it's
>> important to set the encoding as early in the servlet processing
>> pipeline as possible.
>
>
> Thank you for the answer.
>
>
>>
>> For your particular case it's hard to imagine an encoding in practice
>> that would make that string appear incorrectly.  Both iso-8859-1 and
>> utf-8 should handle Ü correctly.
>
>
> I don't think that's true.  A "Ü" in iso-8859-1 is a single byte (\xDC).  In
> Unicode/UTF-8 encoding, it is 2 bytes (\xC39C).  (The Unicode codepoint of
> "Ü" is 00DC (hex), but that's a different matter.)
>
> So if the servlet reads a parameter from the post, thinking the post is
> UTF-8 while it is really iso-8859-1, and this parameter is a "Ü", the
> servlet will read 2 bytes, getting \xDC and whichever byte follows it, and
> get garbage, because \xDC followed by any other byte is probably not valid
> UTF-8.
> On the other hand, if the servlet reads a parameter from the post, thinking
> the post is iso-8859-1 while it is really UTF-8, and this parameter is a
> "Ü", the servlet will read a single byte (\xC3), which will be converted to
> the Java Unicode character with codepoint 00C3 (hex), which is a capital A
> tilde (can't even type that on my German keyboard).
>
> In fact, this is what happens in reality :
>
> We have a html page, defined as being content-type="text/html;
> charset=UTF-8".
> It is saved as UTF-8, by a Unicode-savvy editor.
> It is received by the browser, and the browser (IE or Firefox) says that the
> document is UTF-8.
> The page contains a <form> tag, which contains an enctype="UTF-8" attribute.
> The form contains an input text box, in which the user types a "Ü" and then
> submits the form.
>
> In the normal configuration of the target webapp, there are
> filter1
> filter2
> servlet
> (in that order).
> servlet reads the post parameters and the servlet gets garbage instead of
> the Java string "Ü".
>
> If we remove filter1 and filter2, leaving servlet alone, then servlet reads
> the proper "Ü".
>
> In we re-instate filter1 and filter2, and in filter2 (the only piece of
> which I control the code), I add an early call to
> request.setCharacterEncoding("UTF-8");
> then servlet gets the correct string.
>
> Who is "responsible" for setting the request character set ? In my naive
> understanding, I thought that whenever a method call happens which requires
> parsing the request body, and if by that time the request encoding has not
> been set explicitly, it would be Tomcat code which would evaluate the
> circumstances and set the encoding appropriately.
> Such as :
> - default is iso-8859-1 (as per HTTP default)
> - but if the request somehow says otherwise (*), then whatever the request
> says.
>  ((*) which for a POST it should always do, no ?)
>
> Is that a wrong understanding ?
> (I read the Servlet Spec v 3.0, section 3.10, but I am still not sure)
>
> filter2 contain calls, in that order, to
> - config.getInitParameter
> - optionally, for testing : request.setCharacterEncoding("UTF-8")
> - request.getRequestURL
> - request.getQueryString
> - request.getRemoteAddr
> - request.getHeaderNames
> - request.getHeader
> - request.getAttributeNames
> .. and, finally, a
> - request.getParameter
>
> Is it then the responsibility of filter2 to set the request encoding ?
> Should the optional request.setCharacterEncoding become mandatory ?
> Should the request.setCharacterEncoding call be made just before the
> request.getParameter, or is there another earlier method call in the list
> above that can trigger the encoding to be already set ?
>

Parameters parsing happens once and is triggered by the first call
that requests them.
That call is usually request.getParameter(), but there are two other
similar methods.

At _that_ moment the conversion from bytes to Strings happens and the
request encoding must already be set.

It is application's responsibility to set the request encoding. It
defaults to ISO-8859-1 if not set explicitly. (Maybe it will parse
charset value if that is specified in Content-Type header of request,
but most browsers do not include charset in their request, so that is
irrelevant).

Note, that there is standard "SetCharacterEncodingFilter" in Tomcat 7.
(In 7.0 it is in o.a.c.filters package, in 6.0 and 5.5 it is examples webapp).

Once again,
http://wiki.apache.org/tomcat/FAQ/CharacterEncoding

Best regards,
Konstantin Kolinko

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: Character set issue

Posted by André Warnier <aw...@ice-sa.com>.
Marvin Addison wrote:
>> /can/ the servlet (or one of the filters)
>> do anything that would cause the value of "name1" to /not/ be a correct Java
>> "TÜV" string in the servlet ?
> 
> Yes, absolutely.  If this is a posted value and some filter fires that
> coerces the encoding (e.g. request.getParameter() in the case of POST)
> of the request, all subsequent filters and the servlet will see the
> string in the encoding of the first filter.  This is why it's
> important to set the encoding as early in the servlet processing
> pipeline as possible.

Thank you for the answer.

> 
> For your particular case it's hard to imagine an encoding in practice
> that would make that string appear incorrectly.  Both iso-8859-1 and
> utf-8 should handle Ü correctly.

I don't think that's true.  A "Ü" in iso-8859-1 is a single byte (\xDC).  In Unicode/UTF-8 
encoding, it is 2 bytes (\xC39C).  (The Unicode codepoint of "Ü" is 00DC (hex), but that's 
a different matter.)

So if the servlet reads a parameter from the post, thinking the post is UTF-8 while it is 
really iso-8859-1, and this parameter is a "Ü", the servlet will read 2 bytes, getting 
\xDC and whichever byte follows it, and get garbage, because \xDC followed by any other 
byte is probably not valid UTF-8.
On the other hand, if the servlet reads a parameter from the post, thinking the post is 
iso-8859-1 while it is really UTF-8, and this parameter is a "Ü", the servlet will read a 
single byte (\xC3), which will be converted to the Java Unicode character with codepoint 
00C3 (hex), which is a capital A tilde (can't even type that on my German keyboard).

In fact, this is what happens in reality :

We have a html page, defined as being content-type="text/html; charset=UTF-8".
It is saved as UTF-8, by a Unicode-savvy editor.
It is received by the browser, and the browser (IE or Firefox) says that the document is 
UTF-8.
The page contains a <form> tag, which contains an enctype="UTF-8" attribute.
The form contains an input text box, in which the user types a "Ü" and then submits the form.

In the normal configuration of the target webapp, there are
filter1
filter2
servlet
(in that order).
servlet reads the post parameters and the servlet gets garbage instead of the Java string "Ü".

If we remove filter1 and filter2, leaving servlet alone, then servlet reads the proper "Ü".

In we re-instate filter1 and filter2, and in filter2 (the only piece of which I control 
the code), I add an early call to
request.setCharacterEncoding("UTF-8");
then servlet gets the correct string.

Who is "responsible" for setting the request character set ? In my naive understanding, I 
thought that whenever a method call happens which requires parsing the request body, and 
if by that time the request encoding has not been set explicitly, it would be Tomcat code 
which would evaluate the circumstances and set the encoding appropriately.
Such as :
- default is iso-8859-1 (as per HTTP default)
- but if the request somehow says otherwise (*), then whatever the request says.
   ((*) which for a POST it should always do, no ?)

Is that a wrong understanding ?
(I read the Servlet Spec v 3.0, section 3.10, but I am still not sure)

filter2 contain calls, in that order, to
- config.getInitParameter
- optionally, for testing : request.setCharacterEncoding("UTF-8")
- request.getRequestURL
- request.getQueryString
- request.getRemoteAddr
- request.getHeaderNames
- request.getHeader
- request.getAttributeNames
.. and, finally, a
- request.getParameter

Is it then the responsibility of filter2 to set the request encoding ?
Should the optional request.setCharacterEncoding become mandatory ?
Should the request.setCharacterEncoding call be made just before the request.getParameter, 
or is there another earlier method call in the list above that can trigger the encoding to 
be already set ?



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: Character set issue

Posted by Marvin Addison <ma...@gmail.com>.
> /can/ the servlet (or one of the filters)
> do anything that would cause the value of "name1" to /not/ be a correct Java
> "TÜV" string in the servlet ?

Yes, absolutely.  If this is a posted value and some filter fires that
coerces the encoding (e.g. request.getParameter() in the case of POST)
of the request, all subsequent filters and the servlet will see the
string in the encoding of the first filter.  This is why it's
important to set the encoding as early in the servlet processing
pipeline as possible.

For your particular case it's hard to imagine an encoding in practice
that would make that string appear incorrectly.  Both iso-8859-1 and
utf-8 should handle Ü correctly.

M

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org