You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@tomcat.apache.org by albrecht andrzejewski <al...@ema.fr> on 2008/10/07 19:05:47 UTC
utf8 encoding broken with tomcat and mod proxy ajp
I use a chat servlet deployed in tomcat 5.5.
Because i live in France, i need some special characters like é / à / ç.
When testing with my IDE (netbeans + tomcat 5.5) i've no problem with
these characters, as i use UTF-8.
When serving the HTML canvas page, i declare UTF-8 encoding in the
doGet() method of the servlet :
response.setContentType("text/html;charset=UTF-8");
Chat messages are sent using http / POST, and according to the http
specs, it should use the character encoding of the webpage. When the
webapp receive a POST message, it make it available for all the users,
using JSON. So i use:
response.setContentType("application/json;charset=UTF-8");
when serving the new JSON chat messages (javascript native encoding is utf-8).
As i said, there is no matter when using it in my vanillia Tomcat
bundled with the IDE. But when deploying and testing on the production
server, there is no way to obtain these utf-8 characters!
My production server is setup with apache and mod_proxy_AJP, so i
declare using UTF-8 to the AJP connector with this line:
<Connector port="8009" address="127.0.0.1" enableLookups="false"
redirectPort="8443" URIEncoding="UTF-8" protocol="AJP/1.3" />
I obviously added the AddDefaultCharset UTF-8 to my httpd.conf apache
configuration file.
Here is my mod_proxy virtual host configuration:
<VirtualHost *:80>
ServerName serv1.xxx.tld
ProxyRequests Off
ProxyPreserveHost On
AddDefaultCharset UTF-8
ErrorLog /var/log/apache2/apache.tomcat-serv1.error.log
CustomLog /var/log/apache2/apache.tomcat-serv1.log combined
RedirectMatch ^/$ /bpc1/
<Proxy *>
Order deny,allow
Allow from all
</Proxy>
ProxyPass /bpc1/ ajp://localhost:8009/bpc1/
ProxyPassReverse /bpc1/ ajp://localhost:8009/bpc1/
</VirtualHost>
I've looked at my HTTP headers.
The post header have this content type:
Content-Type text/html;charset=UTF-8
The json response header given have this content type:
Content-Type application/json;charset=UTF-8
But when the letter "é" is sent, "é" is received.
So i know this is not a pure Tomcat problem, but any help, any clue
would be appreciated, i probably missed something when configuring the
server...
Thanks by advance.
--
Albrecht ANDRZEJEWSKI
Créateur - Incubateur Technologique
***
http://haveacafe.wordpress.com/
----------------------------------------------------
Ce message a ete envoye par le serveur IMP de l'EMA.
---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org
Re: Tomcat default encoding character ? Dfile.encoding option mean
?
Posted by André Warnier <aw...@ice-sa.com>.
albrecht andrzejewski wrote:
> I ran accros the ml archives, and i find some useful posts.
>
> I've almost solved my problem: i can now display the accent (é è à)
> using request.setCharacterEncoding("UTF-8");
> response.setCharacterEncoding("UTF-8");
>
> It seems that the default charset for tomcat is ISO 8859 1
> The j2ee javadoc says:
>
> "If no charset is specified, ISO-8859-1 will be used."
>
> I was pretty sure that tomcat handles UTF-8 by default, but it's not the
> case...at least for HttpServletResponse objects. Anyway, do you know if
> it's possible to set up a default charset for the wjole tomcat response,
> instead of calling these two methods every time a request reach the
> servlet... ?
>
> I tried to define the CATALINA_OPTS, but perhaps the file encoding is
> different from the request/response encoding.
> CATALINA_OPTS="-Dfile.encoding=UTF-8"
> export LC_ALL CATALINA_OPTS
>
Take the following with caution, because I do not really know the
underlying reason in Tomcat :
I have found that setting the LC_CTYPE environment variable to a UTF-8
"locale" (or inversely, to a ISO-8859-1 locale) prior to starting Tomcat
influences the way in which *some* servlets are reading request bodies
and/or encoding request responses.
You can do this in the startup.sh script, or probably more correctly in
the setenv.sh script, in the Tomcat/bin directory (that is, if your
Tomcat is "the" canonical distribution; if your Tomcat comes from a
pre-packaged version, it may not use these scripts for startup).
Make sure to use a valid and installed locale.
do
locale -a
choose in the list an installed locale which fits and says "utf8" in the
name and add it to the script (for example) :
LC_CTYPE="en_US.utf8"; export LC_CTYPE
prior to starting Tomcat.
(in the above, I am assuming Unix/Linux; under Windows it may not be
feasible).
One reason to be careful with this anyway, is that it may have
unexpected consequences on other servlets.
I believe this happens when the servlet itself is not specifying
explicitly the encoding it uses for reading the request body or writing
the response, and the JVM then defaults to the locale setting of the
process that runs it and Tomcat.
In other words, in my opinion your solution above of setting this
explicitly in your servlet is the better one.
Also make sure that all the html pages that you serve contain a tag like
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
If your html pages contain <form> tags, and you would like the browser
to be nice and send you proper UTF-8 encoded form values when posting a
form content, then add the following attributes to them, to try and
convince the browser to do the right thing :
<form .. method="POST" enctype="multipart/form-data"
accept-charset="UTF-8">
And then, if you design and edit your html pages yourself, make sure
that you use an editor that supports UTF-8, and save your pages as such.
And then, verify at the browser level (for example with Firefox and the
LiveHttpHeaders extension), that the browser is effectively receiving a
HTTP header like
Content-Type: text/html; charset=UTF-8
with every response from your server.
Paranoia : since you cannot trust the user nor his browser anyway, you
may still want to add in your <form>s a hidden input field, containing a
set value that is a known string in UTF-8 with some accented characters.
Then in your application, you could check if you really received that
string as expected. If not, then something unexpected happened with the
form encoding, and you should reject the data. Something thus like :
<input type="hidden" value="ÁlélÜìÄ">
which will have a different "string length" depending on whether it is
encoded as UTF-8 or iso-8859-1 (an "é" is 1 byte in iso-8859-1, but 2
bytes in Unicode/UTF-8).
That is not really paranoia, it's experience.
That was the practical bit. If you more general theorising, keep reading.
In general, for historical reasons mostly, the default charset/encoding
for HTML and HTTP is ISO-8859-1 (latin-1).
This is not always clear in all RFCs that contribute to various aspects
of web applications however, so there is a certain amount of confusion.
For example, the RFCs concerning HTML are quite clear (iso-8859-1 by
default), while the RFCs concerning HTTP URIs are more vague or mutually
contradictory.
In any case, it is (unfortunately) not Unicode/UTF-8 everywhere by
default, despite the hopes and beliefs of some web developers.
The fact that the internal Java charset is Unicode, and its default
external charset/encoding is Unicode/UTF-8, tends to comfort some
Java/Tomcat developers in the false belief that URLs also by default are
UTF-8, while they are not (as far as I can determine, they are
encoding-neutral).
Some people also believe that UTF-8 and iso-8859-1 are identical anyway
for the first 256 Unicode code points, so it doesn't really matter. But
this is also incorrect (only the first 128 codes overlap), and it does
matter for anyone trying to build an application that is not purely
English-speaking, as you have noticed.
And finally, there seems to be some confusion between a parameter that
specifies a default encoding for Tomcat's internal processing of URIs,
with the request body or response body encoding. There is also a
parameter I believe that specifies something like "use the body encoding
for the URL also" or vice-versa.
Add to this, that users can set up their browser in various ways, that
they may have various keyboards and operating systems, that some
browsers disregard what the server says about documents anyway and think
they are smarter, and you get the situation that exists currently on the
web, where half the time I cannot enter my first name in a web form and
see it returned to me correctly in a response or an email. And I guess
you may not be faring much better with your last name..
Tout cela ne simplifie pas les choses, mais...
The good news is that it appears to be improving over time, with correct
UTF-8 support now in all browsers, and a tendency by web developers to
specify UTF-8 explicitly wherever it's needed.
Which is many places, if you really want to get all the chips on your side.
---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org
Tomcat default encoding character ? Dfile.encoding option mean ?
Posted by albrecht andrzejewski <al...@ema.fr>.
I ran accros the ml archives, and i find some useful posts.
I've almost solved my problem: i can now display the accent (é è à)
using request.setCharacterEncoding("UTF-8");
response.setCharacterEncoding("UTF-8");
It seems that the default charset for tomcat is ISO 8859 1
The j2ee javadoc says:
"If no charset is specified, ISO-8859-1 will be used."
I was pretty sure that tomcat handles UTF-8 by default, but it's not
the case...at least for HttpServletResponse objects. Anyway, do you
know if it's possible to set up a default charset for the wjole tomcat
response, instead of calling these two methods every time a request
reach the servlet... ?
I tried to define the CATALINA_OPTS, but perhaps the file encoding is
different from the request/response encoding.
CATALINA_OPTS="-Dfile.encoding=UTF-8"
export LC_ALL CATALINA_OPTS
--
Albrecht ANDRZEJEWSKI
Créateur - Incubateur Technologique
SITE-EERIE - Parc scientifique G. Besse
***
http://haveacafe.wordpress.com/
----------------------------------------------------
Ce message a ete envoye par le serveur IMP de l'EMA.
---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org