You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@tomcat.apache.org by albrecht andrzejewski <al...@ema.fr> on 2008/10/07 19:05:47 UTC

utf8 encoding broken with tomcat and mod proxy ajp

I use a chat servlet deployed in tomcat 5.5.

Because i live in France, i need some special characters like é / à / ç.
When testing with my IDE (netbeans + tomcat 5.5) i've no problem with  
these characters, as i use UTF-8.

When serving the HTML canvas page, i declare UTF-8 encoding in the  
doGet() method of the servlet :

  response.setContentType("text/html;charset=UTF-8");

Chat messages are sent using http / POST, and according to the http  
specs, it should use the character encoding of the webpage. When the  
webapp receive a POST message, it make it available for all the users,  
using JSON. So i use:

  response.setContentType("application/json;charset=UTF-8");

when serving the new JSON chat messages (javascript native encoding is utf-8).

As i said, there is no matter when using it in my vanillia Tomcat  
bundled with the IDE. But when deploying and testing on the production  
server, there is no way to obtain these utf-8 characters!


My production server is setup with apache and mod_proxy_AJP, so i  
declare using UTF-8 to the AJP connector with this line:

<Connector port="8009" address="127.0.0.1" enableLookups="false"  
redirectPort="8443" URIEncoding="UTF-8" protocol="AJP/1.3" />

I obviously added the AddDefaultCharset UTF-8 to my httpd.conf apache  
configuration file.

Here is my mod_proxy virtual host configuration:

<VirtualHost *:80>
   ServerName serv1.xxx.tld
   ProxyRequests Off
   ProxyPreserveHost On
   AddDefaultCharset UTF-8
   ErrorLog /var/log/apache2/apache.tomcat-serv1.error.log
   CustomLog /var/log/apache2/apache.tomcat-serv1.log combined
   RedirectMatch ^/$ /bpc1/
   <Proxy *>
      Order deny,allow
      Allow from all
   </Proxy>
   ProxyPass /bpc1/ ajp://localhost:8009/bpc1/
   ProxyPassReverse /bpc1/ ajp://localhost:8009/bpc1/
</VirtualHost>


I've looked at my HTTP headers.
The post header have this content type:
Content-Type	text/html;charset=UTF-8
The json response header given have this content type:
Content-Type	application/json;charset=UTF-8

But when the letter "é" is sent, "é" is received.


So i know this is not a pure Tomcat problem, but any help, any clue  
would be appreciated, i probably missed something when configuring the  
server...

Thanks by advance.


-- 
Albrecht ANDRZEJEWSKI
Créateur - Incubateur Technologique
***
http://haveacafe.wordpress.com/

----------------------------------------------------
Ce message a ete envoye par le serveur IMP de l'EMA.



---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: Tomcat default encoding character ? Dfile.encoding option mean ?

Posted by André Warnier <aw...@ice-sa.com>.
albrecht andrzejewski wrote:
> I ran accros the ml archives, and i find some useful posts.
> 
> I've almost solved my problem: i can now display the accent (é è à) 
> using   request.setCharacterEncoding("UTF-8");
> response.setCharacterEncoding("UTF-8");
> 
> It seems that the default charset for tomcat is ISO 8859 1
> The j2ee javadoc says:
> 
> "If no charset is specified, ISO-8859-1 will be used."
> 
> I was pretty sure that tomcat handles UTF-8 by default, but it's not the 
> case...at least for HttpServletResponse objects. Anyway, do you know if 
> it's possible to set up a default charset for the wjole tomcat response, 
> instead of calling these two methods every time a request reach the 
> servlet... ?
> 
> I tried to define the CATALINA_OPTS, but perhaps the file encoding is 
> different from the request/response encoding.
> CATALINA_OPTS="-Dfile.encoding=UTF-8"
> export LC_ALL CATALINA_OPTS
> 

Take the following with caution, because I do not really know the 
underlying reason in Tomcat :

I have found that setting the LC_CTYPE environment variable to a UTF-8 
"locale" (or inversely, to a ISO-8859-1 locale) prior to starting Tomcat 
influences the way in which *some* servlets are reading request bodies 
and/or encoding request responses.
You can do this in the startup.sh script, or probably more correctly in 
the setenv.sh script, in the Tomcat/bin directory (that is, if your 
Tomcat is "the" canonical distribution; if your Tomcat comes from a 
pre-packaged version, it may not use these scripts for startup).
Make sure to use a valid and installed locale.
do
locale -a
choose in the list an installed locale which fits and says "utf8" in the 
name and add it to the script (for example) :
LC_CTYPE="en_US.utf8"; export LC_CTYPE
prior to starting Tomcat.

(in the above, I am assuming Unix/Linux; under Windows it may not be 
feasible).

One reason to be careful with this anyway, is that it may have 
unexpected consequences on other servlets.
I believe this happens when the servlet itself is not specifying 
explicitly the encoding it uses for reading the request body or writing 
the response, and the JVM then defaults to the locale setting of the 
process that runs it and Tomcat.

In other words, in my opinion your solution above of setting this 
explicitly in your servlet is the better one.

Also make sure that all the html pages that you serve contain a tag like
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>

If your html pages contain <form> tags, and you would like the browser 
to be nice and send you proper UTF-8 encoded form values when posting a 
form content, then add the following attributes to them, to try and 
convince the browser to do the right thing :
<form .. method="POST" enctype="multipart/form-data" 
accept-charset="UTF-8">

And then, if you design and edit your html pages yourself, make sure 
that you use an editor that supports UTF-8, and save your pages as such.

And then, verify at the browser level (for example with Firefox and the 
LiveHttpHeaders extension), that the browser is effectively receiving a 
HTTP header like
Content-Type: text/html; charset=UTF-8
with every response from your server.

Paranoia : since you cannot trust the user nor his browser anyway, you 
may still want to add in your <form>s a hidden input field, containing a 
set value that is a known string in UTF-8 with some accented characters. 
  Then in your application, you could check if you really received that 
string as expected.  If not, then something unexpected happened with the 
form encoding, and you should reject the data. Something thus like :
<input type="hidden" value="ÁlélÜìÄ">
which will have a different "string length" depending on whether it is 
encoded as UTF-8 or iso-8859-1 (an "é" is 1 byte in iso-8859-1, but 2 
bytes in Unicode/UTF-8).
That is not really paranoia, it's experience.

That was the practical bit. If you more general theorising, keep reading.

In general, for historical reasons mostly, the default charset/encoding 
for HTML and HTTP is ISO-8859-1 (latin-1).
This is not always clear in all RFCs that contribute to various aspects 
of web applications however, so there is a certain amount of confusion. 
  For example, the RFCs concerning HTML are quite clear (iso-8859-1 by 
default), while the RFCs concerning HTTP URIs are more vague or mutually 
contradictory.
In any case, it is (unfortunately) not Unicode/UTF-8 everywhere by 
default, despite the hopes and beliefs of some web developers.

The fact that the internal Java charset is Unicode, and its default 
external charset/encoding is Unicode/UTF-8, tends to comfort some 
Java/Tomcat developers in the false belief that URLs also by default are 
UTF-8, while they are not (as far as I can determine, they are 
encoding-neutral).

Some people also believe that UTF-8 and iso-8859-1 are identical anyway 
for the first 256 Unicode code points, so it doesn't really matter.  But 
this is also incorrect (only the first 128 codes overlap), and it does 
matter for anyone trying to build an application that is not purely 
English-speaking, as you have noticed.

And finally, there seems to be some confusion between a parameter that 
specifies a default encoding for Tomcat's internal processing of URIs, 
with the request body or response body encoding.  There is also a 
parameter I believe that specifies something like "use the body encoding 
for the URL also" or vice-versa.

Add to this, that users can set up their browser in various ways, that 
they may have various keyboards and operating systems, that some 
browsers disregard what the server says about documents anyway and think 
they are smarter, and you get the situation that exists currently on the 
web, where half the time I cannot enter my first name in a web form and 
see it returned to me correctly in a response or an email.  And I guess 
you may not be faring much better with your last name..

Tout cela ne simplifie pas les choses, mais...

The good news is that it appears to be improving over time, with correct 
UTF-8 support now in all browsers, and a tendency by web developers to 
specify UTF-8 explicitly wherever it's needed.
Which is many places, if you really want to get all the chips on your side.







---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Tomcat default encoding character ? Dfile.encoding option mean ?

Posted by albrecht andrzejewski <al...@ema.fr>.
I ran accros the ml archives, and i find some useful posts.

I've almost solved my problem: i can now display the accent (é è à)  
using   request.setCharacterEncoding("UTF-8");
response.setCharacterEncoding("UTF-8");

It seems that the default charset for tomcat is ISO 8859 1
The j2ee javadoc says:

"If no charset is specified, ISO-8859-1 will be used."

I was pretty sure that tomcat handles UTF-8 by default, but it's not  
the case...at least for HttpServletResponse objects. Anyway, do you  
know if it's possible to set up a default charset for the wjole tomcat  
response, instead of calling these two methods every time a request  
reach the servlet... ?

I tried to define the CATALINA_OPTS, but perhaps the file encoding is  
different from the request/response encoding.
CATALINA_OPTS="-Dfile.encoding=UTF-8"
export LC_ALL CATALINA_OPTS

-- 
Albrecht ANDRZEJEWSKI
Créateur - Incubateur Technologique
SITE-EERIE - Parc scientifique G. Besse
***
http://haveacafe.wordpress.com/

----------------------------------------------------
Ce message a ete envoye par le serveur IMP de l'EMA.



---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org