You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@tomcat.apache.org by Raghuveer <ra...@infotechsw.com> on 2008/06/24 13:53:39 UTC

Encoding attribute is used for request and response only or for any other
purpose?

 

To handle Http request and response in utf8 I have added following code in
my web.xml in JSP application

 

<?xml version="1.0" encoding="utf8"

 

Is this correct procedure 


RE: Posted by Raghuveer <ra...@infotechsw.com>.
Hi Andre,

Thanks for hint.
I have read your solutions on other threads.
I will implement your solution of filters in my project.

Regards,
Raghu


-----Original Message-----
From: André Warnier [mailto:aw@ice-sa.com] 
Sent: Tuesday, June 24, 2008 5:57 PM
To: Tomcat Users List
Subject: Re: <?xml version="1.0" encoding="ISO-8859 in web.xml


Raghuveer wrote:
> Encoding attribute is used for request and response only or for any other
> purpose?
> To handle Http request and response in utf8 I have added following code in
> my web.xml in JSP application
> 
> <?xml version="1.0" encoding="utf8"
> 
> Is this correct procedure 
> 
In short, no.
It will not hurt, but it has nothing to do with the handling of requests 
and responses.
The "encoding" attribute of the <xml> tag in the various Tomcat 
configuration files, just specifies to the module that parses these 
configuration files, in which character set this configuration file is 
written.  And since encoding="UTF-8" is the default for XML files, what 
you did above basically changes nothing at all.

Now, to answer your real question about UTF-8 request/response handling 
: that is really a very wide question that you are asking, and you 
should probably take this a little bit at a time.

The good news is that Tomcat and Java basically default to 
Unicode/UTF-8, so unless you do things really wrong, it should not be a 
big problem to support UTF-8 requests and responses.

The following previous messages in this forum entitled "UTF-8 handling 
differs between two servlets within the same application", will already 
provide you with some good pointers.

André

---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: Posted by Christopher Schultz <ch...@christopherschultz.net>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

André,

André Warnier wrote:
| It has been recently shown in a thread in this same forum
| that one does not normally need a filter, and I would submit that using
| a filter as indicated will corrupt data in some instances.

I disagree. The filter is required for clients which silently submit
UTF-8. In that case, the server defaults to ISO-8859-1 and your data is
corrupted.

Your demand that nobody should use GET parameters is unreasonable. Ergo,
the filter is required.

| In the article at
|  > http://wiki.apache.org/tomcat/FAQ/CharacterEncoding
| there is also a problem in the form shown under the title

Then fix it. You are a member of this community, too, now ;)

| How can I test if my configuration will work correctly?

Create a URL that contains encoded UTF-8 characters that would displayed
differently if interpreted as ISO-8859-1. This is not difficult to do.

For instance, http://my.server.com/something?psi=ψ

| As demonstrated by a recent thread here also, the <form> tag as shown,
| is missing a
| enctype="multipart/form-data"
| attribute.

The default form enctype is application/x-www-form-urlencoded, which is
fine when no <input type="file" /> form elements are present (see
http://www.w3.org/TR/html401/interact/forms.html#h-17.3).

| This will cause Tomcat to misinterpret the form data in some cases.
| One could also argue that adding an attribute
| accept-charset="UTF-8"
| would make it even more failsafe.

Fair enough. I'm not sure this affects GET requests generated by forms,
though. Also, not all parameters in GET URLs are from forms: some are
normal links (and are often problematic).

| In addition, the article also repeats a mistake often seen, which is to
| tell people that it's ok to send form data via a GET and use non
| US-ASCII data.  This is a receipe for problems, see the first mentioned
| article at java.sun.com.

The only reason it's a recipe for problems is because clients are
inconsistent about their use of character encoding in URLs. Non-ASCII
characters are fine as long as the client and server agree on the
encoding (which is sometimes problematic). Don't confuse the issue of
non-ascii characters in URLs (which is fine) with the inability of
clients and servers to communicate the character encoding properly
(which is not fine).

| Now, I know that these are Wiki articles and can be corrected by anyone,
| but isn't that a problem ? For better or worse, these articles are used
| as reference by Tomcat users.  See your own response above.
| If someone goes ahead and posts incorrect technical stuff there, there
| is a problem, no ?

Yes. If something is wrong, it should be fixed. We might only find out
that it's broken because someone reads it and finds a problem. Given
your passion for the Truth-with-a-capital-T, please correct the article.
Someone in the future may re-correct it if your version of the truth
turns out to be ... lacking.

| I mean that I, as a mere user, don't feel at ease going ahead and
| modifying the Wiki article of someone else unilaterally, nor of posting
| another one saying the previous one is all wrong.  But maybe there
| should be some form of authoritative control of the accuracy of what is
| posted there ?

The Wiki is a wiki so that the documentation can grow organically,
rather than having to wait until some blessed Tomcat developer gets
around to fixing the documentation. The power has been placed into your
hands for a reason. Wikis keep versions, so if you replace everything
with porn, it'll just get reverted and you'll get booted off. Given that
you will likely be making a positive contribution, I'm sure your changes
will stick.

You have to abandon the "us versus them" mentality that you have about
you and the rest of the community. Most of the active users on this list
are not Tomcat developers. There is no "them". There is only "us".

- -chris
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkhieJoACgkQ9CaO5/Lv0PAsdACgqgKUeQEB+5y+hGWePFNEfpfk
l/AAoKEItRcDZfU1BQmPss8qZ5OXc/Hu
=cy91
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


RE: Posted by Raghuveer <ra...@infotechsw.com>.
Hi Andre,
I have implemented and it is perfectly working.
Thanks for very valuable information.

Regards,
Raghu

-----Original Message-----
From: André Warnier [mailto:aw@ice-sa.com] 
Sent: Wednesday, June 25, 2008 2:52 PM
To: Tomcat Users List
Subject: Re: <?xml version="1.0" encoding="ISO-8859 in web.xml


Christopher Schultz wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> 
> 
> André Warnier wrote:
> | What else does need to be done at the Tomcat configuration level so that
> | it would handle UTF-8 requests properly, and produce UTF-8 responses
> | properly ?
> 
> <sigh> I hate responding with the same old stuff, but these sources of
> information really do cover everything we are perseverating over:
> 
> http://wiki.apache.org/tomcat/FAQ/CharacterEncoding
> http://wiki.apache.org/tomcat/Tomcat/UTF-8
> Also:
> http://java.sun.com/developer/technicalArticles/Intl/HTTPCharset/
> 
The last reference (which I did not know) is excellent.  Thank you.

But the other two references, if you are perseverating over them, are in 
my view not good references worth perseverating over.

The article at
 > http://wiki.apache.org/tomcat/Tomcat/UTF-8
is incorrect.  The second part (Alternative) has been recently corrected 
for the better, but the very premise of the article is wrong and 
misleading.  It has been recently shown in a thread in this same forum 
that one does not normally need a filter, and I would submit that using 
a filter as indicated will corrupt data in some instances.

In the article at
 > http://wiki.apache.org/tomcat/FAQ/CharacterEncoding
there is also a problem in the form shown under the title

How can I test if my configuration will work correctly?

As demonstrated by a recent thread here also, the <form> tag as shown, 
is missing a
enctype="multipart/form-data"
attribute.
This will cause Tomcat to misinterpret the form data in some cases.
One could also argue that adding an attribute
accept-charset="UTF-8"
would make it even more failsafe.

In addition, the article also repeats a mistake often seen, which is to 
tell people that it's ok to send form data via a GET and use non 
US-ASCII data.  This is a receipe for problems, see the first mentioned 
article at java.sun.com.

That article explains the basic reason why it is a problem : although 
there exist (more or less) rules as to how to encode non-ASCII data in 
URLs, the problem is that when it receives such a request, the server 
has basically no idea how the URL was actually encoded, so it can only 
guess at how to decode it properly.

This is also explicitly discouraged in the HTML 4.01 RFC at
(http://www.w3.org/TR/html401/interact/forms.html#submit-format
17.13.4 Form content types )

Now, I know that these are Wiki articles and can be corrected by anyone, 
but isn't that a problem ? For better or worse, these articles are used 
as reference by Tomcat users.  See your own response above.
If someone goes ahead and posts incorrect technical stuff there, there 
is a problem, no ?
I mean that I, as a mere user, don't feel at ease going ahead and 
modifying the Wiki article of someone else unilaterally, nor of posting 
another one saying the previous one is all wrong.  But maybe there 
should be some form of authoritative control of the accuracy of what is 
posted there ?

André


---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: Posted by Mark Thomas <ma...@apache.org>.
André Warnier wrote:
> The last reference (which I did not know) is excellent.  Thank you.
> 
> But the other two references, if you are perseverating over them, are in 
> my view not good references worth perseverating over.
> 
> The article at
>  > http://wiki.apache.org/tomcat/Tomcat/UTF-8
> is incorrect.  The second part (Alternative) has been recently corrected 
> for the better, but the very premise of the article is wrong and 
> misleading.  It has been recently shown in a thread in this same forum 
> that one does not normally need a filter, and I would submit that using 
> a filter as indicated will corrupt data in some instances.

The first solution is horrible as a standard approach but is a useful 
example of how you might recover mangled UTF-8 text.

You could also add that a filter should be unnecessary but that many 
developers prefer it as it 'fixes' all pages with a few lines of code 
rather than having to fix every single page.

> In the article at
>  > http://wiki.apache.org/tomcat/FAQ/CharacterEncoding
> there is also a problem in the form shown under the title
> 
> How can I test if my configuration will work correctly?
> 
> As demonstrated by a recent thread here also, the <form> tag as shown, 
> is missing a
> enctype="multipart/form-data"
> attribute.

Agreed. Feel free to fix it.

> This will cause Tomcat to misinterpret the form data in some cases.
> One could also argue that adding an attribute
> accept-charset="UTF-8"
> would make it even more failsafe.

Wouldn't do any harm.

> In addition, the article also repeats a mistake often seen,s which is to 
> tell people that it's ok to send form data via a GET and use non 
> US-ASCII data.  This is a receipe for problems, see the first mentioned 
> article at java.sun.com.

There you get into a grey area in the various specs. Probably the best 
solution is a comment that says POST is easier to control than GET but if 
you are stuck with GET for whatever reason then...

> Now, I know that these are Wiki articles and can be corrected by anyone, 
> but isn't that a problem ? For better or worse, these articles are used 
> as reference by Tomcat users.  See your own response above.
> If someone goes ahead and posts incorrect technical stuff there, there 
> is a problem, no ?
> I mean that I, as a mere user, don't feel at ease going ahead and 
> modifying the Wiki article of someone else unilaterally, nor of posting 
> another one saying the previous one is all wrong.  But maybe there 
> should be some form of authoritative control of the accuracy of what is 
> posted there ?

This is a community of which we are all members. We are all equally 
responsible for keeping the Wiki relevant and accurate. Any and every 
member of this community has an equal right to go and edit any Wiki article.

There was a long time when Wiki changes were not posted to the dev list so 
no review took place. That has been fixed and all changes now get copied to 
the dev list where they will be reviewed. Gross inaccuracies are likely to 
be corrected quickly.

And yes, I could have made all the changes above myself rather than write 
this but I really would like to see a few more people on this list take the 
plunge and start updating the Wiki, particularly the FAQ.

Mark


---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: Posted by André Warnier <aw...@ice-sa.com>.
Christopher Schultz wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> 
> 
> André Warnier wrote:
> | What else does need to be done at the Tomcat configuration level so that
> | it would handle UTF-8 requests properly, and produce UTF-8 responses
> | properly ?
> 
> <sigh> I hate responding with the same old stuff, but these sources of
> information really do cover everything we are perseverating over:
> 
> http://wiki.apache.org/tomcat/FAQ/CharacterEncoding
> http://wiki.apache.org/tomcat/Tomcat/UTF-8
> Also:
> http://java.sun.com/developer/technicalArticles/Intl/HTTPCharset/
> 
The last reference (which I did not know) is excellent.  Thank you.

But the other two references, if you are perseverating over them, are in 
my view not good references worth perseverating over.

The article at
 > http://wiki.apache.org/tomcat/Tomcat/UTF-8
is incorrect.  The second part (Alternative) has been recently corrected 
for the better, but the very premise of the article is wrong and 
misleading.  It has been recently shown in a thread in this same forum 
that one does not normally need a filter, and I would submit that using 
a filter as indicated will corrupt data in some instances.

In the article at
 > http://wiki.apache.org/tomcat/FAQ/CharacterEncoding
there is also a problem in the form shown under the title

How can I test if my configuration will work correctly?

As demonstrated by a recent thread here also, the <form> tag as shown, 
is missing a
enctype="multipart/form-data"
attribute.
This will cause Tomcat to misinterpret the form data in some cases.
One could also argue that adding an attribute
accept-charset="UTF-8"
would make it even more failsafe.

In addition, the article also repeats a mistake often seen, which is to 
tell people that it's ok to send form data via a GET and use non 
US-ASCII data.  This is a receipe for problems, see the first mentioned 
article at java.sun.com.

That article explains the basic reason why it is a problem : although 
there exist (more or less) rules as to how to encode non-ASCII data in 
URLs, the problem is that when it receives such a request, the server 
has basically no idea how the URL was actually encoded, so it can only 
guess at how to decode it properly.

This is also explicitly discouraged in the HTML 4.01 RFC at
(http://www.w3.org/TR/html401/interact/forms.html#submit-format
17.13.4 Form content types )

Now, I know that these are Wiki articles and can be corrected by anyone, 
but isn't that a problem ? For better or worse, these articles are used 
as reference by Tomcat users.  See your own response above.
If someone goes ahead and posts incorrect technical stuff there, there 
is a problem, no ?
I mean that I, as a mere user, don't feel at ease going ahead and 
modifying the Wiki article of someone else unilaterally, nor of posting 
another one saying the previous one is all wrong.  But maybe there 
should be some form of authoritative control of the accuracy of what is 
posted there ?

André


---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: Posted by Christopher Schultz <ch...@christopherschultz.net>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1



André Warnier wrote:
| What else does need to be done at the Tomcat configuration level so that
| it would handle UTF-8 requests properly, and produce UTF-8 responses
| properly ?

<sigh> I hate responding with the same old stuff, but these sources of
information really do cover everything we are perseverating over:

http://wiki.apache.org/tomcat/FAQ/CharacterEncoding
http://wiki.apache.org/tomcat/Tomcat/UTF-8
Also:
http://java.sun.com/developer/technicalArticles/Intl/HTTPCharset/

- -chris
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkhhOJIACgkQ9CaO5/Lv0PCZ4wCggcyro2J9ZkZHb0WqYoajH1JR
eV8An2RpJeWMaUNuFh9fU/SLqYqsU6KS
=gIDB
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: Posted by André Warnier <aw...@ice-sa.com>.

Christopher Schultz wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> André,
> 
> André Warnier wrote:
> | The good news is that Tomcat and Java basically default to
> | Unicode/UTF-8, so unless you do things really wrong, it should not be a
> | big problem to support UTF-8 requests and responses.
> 
> No, you need to configure Tomcat to use UTF8 as the default. ISO-8859-1
> will be used as per the HTTP spec if no other factors are involved.
> 
Ooops.
What I meant was that Tomcat and it's running apps being Java-based, and 
Java's internal charset being Unicode, stuff like request.getParameter() 
will by default deliver Unicode, without need to go through loops.

But of course that is IF the pages sent to the browser are properly 
UTF-8 and marked as UTF-8, if the <form> tags in them are properly set 
up, if the browser does what it should do and so on.

But maybe that is also still not correct, or not enough.
What else does need to be done at the Tomcat configuration level so that 
it would handle UTF-8 requests properly, and produce UTF-8 responses 
properly ?

André

---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: Posted by Christopher Schultz <ch...@christopherschultz.net>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

André,

André Warnier wrote:
| The good news is that Tomcat and Java basically default to
| Unicode/UTF-8, so unless you do things really wrong, it should not be a
| big problem to support UTF-8 requests and responses.

No, you need to configure Tomcat to use UTF8 as the default. ISO-8859-1
will be used as per the HTTP spec if no other factors are involved.

- -chris
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkhg+jQACgkQ9CaO5/Lv0PCeegCgxBTjldbIW7KEHqX9rFqBK6kI
fb4AoKpFuPM5+4JPtuG7boF7ge6CDamZ
=Pdda
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: Posted by André Warnier <aw...@ice-sa.com>.
Raghuveer wrote:
> Encoding attribute is used for request and response only or for any other
> purpose?
> To handle Http request and response in utf8 I have added following code in
> my web.xml in JSP application
> 
> <?xml version="1.0" encoding="utf8"
> 
> Is this correct procedure 
> 
In short, no.
It will not hurt, but it has nothing to do with the handling of requests 
and responses.
The "encoding" attribute of the <xml> tag in the various Tomcat 
configuration files, just specifies to the module that parses these 
configuration files, in which character set this configuration file is 
written.  And since encoding="UTF-8" is the default for XML files, what 
you did above basically changes nothing at all.

Now, to answer your real question about UTF-8 request/response handling 
: that is really a very wide question that you are asking, and you 
should probably take this a little bit at a time.

The good news is that Tomcat and Java basically default to 
Unicode/UTF-8, so unless you do things really wrong, it should not be a 
big problem to support UTF-8 requests and responses.

The following previous messages in this forum entitled "UTF-8 handling 
differs between two servlets within the same application", will already 
provide you with some good pointers.

André

---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org