You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@tomcat.apache.org by Tremal Naik <tr...@gmail.com> on 2007/11/16 14:14:12 UTC

UTF-8 charset encoding

Hello Tomcat users,
I'm developing an application in Jboss 4.0.2, which uses Tomcat 5.5.9
as web tier.

I'm trying to make Tomcat decoding the request body with the correct
encoding. I have problems with IE and Firefox as well. The html page
has the meta tag:

<html>
<head>
	<meta http-equiv="Content-Type" content="text/html;charset=utf-8" >

all characters are displayed well in both browsers, the page encoding
appears correctly set to UTF-8. The problems arises when I try to
submit "strange" characters as currency symbols (euro, pound, yen,
...) in a form text box.

Debugging Tomcat source, I see the following in the
org.apache.catalina.connector.Request: the parseParameters() method
tries to get the correct encoding:

        String enc = getCharacterEncoding();

the ContentType.getCharsetFromContentType() tries to infer the correct
character encoding from the request Content-Type header, but since it
is not set it returns null:

 public static String getCharsetFromContentType(String type) {
        if (type == null) {
            return null;
        }
        int semi = type.indexOf(";");
        if (semi == -1) {
            return null;
        }

Here the problem is the content type of the http request coming from
the browser is as follows (as grabbed by an http tracer):

Content-Type: application/x-www-form-urlencoded

Hence, it doesn't contain any ";charset=utf-8" appended, as expected.
The body looks to me correctly encoded in UTF-8 format.

Thus, the parseParameters() parses the parameters with its default
character encoding which is ISO-8859-1:

parameters.setEncoding
                (org.apache.coyote.Constants.DEFAULT_CHARACTER_ENCODING);

Now, I know this is not a Tomcat problem, but I don't know how to
force the client to sent the correct information to the server. Any
ideas how to solve this?
It was nice also if I could set up Tomcat default encoding to UTF-8,
without recompiling the source (I have some requirements on the
Jboss/Tomcat: I can't use a modified or different version).

By the way, I already tried to update the file
$CATALINA_HOME/conf/server.xml for UTF-8 support by connectors with
the parameter URIEncoding="UTF-8".



Thanks for your help,

-- 
TREMALNAIK

---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: UTF-8 charset encoding

Posted by Tremal Naik <tr...@gmail.com>.
2007/11/16, Mark Thomas <ma...@apache.org>:
>
> Some standard text I wrote a while ago follows. The most useful bit is
> probably the URIEncoding attribute on the connector.

Thanks Mark, I think I read your paper somewhere before I decided to
write to this help request. In fact, if you read carefully the last
paragraph of my original post, I said that I already tried setting the
connector parameter URIEncoding="UTF-8". Unfortunately, this parameter
is read after the first parse occurred, as I explained there, making
the subsequent setting useless in terms of request body parsing. I
think (I didn't investigate further) this is due a Valve I wrote that
accesses the request parameters before the Connector.getURIEncoding()
is invoked the first time. I solved this with another Valve (which
code is copy/pasted from the SetCharacterEncodingFilter) that is
invoked as the first in the chain.

Thanks,

TN

---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: UTF-8 charset encoding

Posted by Mark Thomas <ma...@apache.org>.
Michael wrote:
> 
> start your JVM with -Dfile.encoding=UTF-8 and try again!

That property is read-only on some JVMs and is not the way to achieve what
the OP is trying to do.

Some standard text I wrote a while ago follows. The most useful bit is
probably the URIEncoding attribute on the connector.

Character encoding summary
==========================

There are a number of situations where there may be a requirement to use non-
US ASCII characters in a URI. These include:
- Parameters in the query string
- Servlet paths

There is a standard for encoding URIs (http://www.w3.org/International/O-URL-
code.html) but this standard is not consistently followed by clients. This
causes a number of problems.

The functionality provided by Tomcat (4 and 5) to handle this less than ideal
situation is described below.

1. The Coyote HTTP/1.1 connector has a useBodyEncodingForURI attribute
which if set to true will use the request body encoding to decode the URI
query parameters.
  - The default value is true for TC4 (breaks spec but gives consistent
behaviour across TC4 versions)
  - The default value is false for TC5 (spec compliant but there may be
migration issues for some apps)
2. The Coyote HTTP/1.1 connector has a URIEncoding attribute which defaults
to ISO-8859-1.
3. The parameters class (o.a.t.u.http.Parameters) has a QueryStringEncoding
field which defaults to the URIEncoding. It must be set before the
parameters are parsed to have an effect.

Things to note regarding the servlet API:
1. HttpServletRequest.setCharacterEncoding() normally only applies to the
request body NOT the URI.
2. HttpServletRequest.getPathInfo() is decoded by the web container.
3. HttpServletRequest.getRequestURI() is not decoded by container.

Other tips:
1. Use POST with forms to return parameters as the parameters are then part of
the request body.


---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: UTF-8 charset encoding

Posted by Michael <sg...@gmx.net>.
start your JVM with -Dfile.encoding=UTF-8 and try again!

bye
-- 
<NO> OOXML - Say NO To Microsoft Office broken standard
http://www.noooxml.org

---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: UTF-8 charset encoding

Posted by Tremal Naik <tr...@gmail.com>.
2007/11/19, Ognjen Blagojevic <og...@etf.bg.ac.yu>:
> I suppose you are using ActionForms. Try to extend ActionForm overriding
> your reset method which will set the character encoding, before the
> parameters are processed. Something like this:

well, I solved with a Valve that impose a default encoding of "UTF-8.
This is the perfect solution for me since I read request parameters in
other valves that come after in the chain, and I want them correctly
encoded. Hence, when the Struts  takes control of the application, the
request parameters are parsed with the already set correct encoding.

Thanks,

TN

---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: UTF-8 charset encoding

Posted by Ognjen Blagojevic <og...@etf.bg.ac.yu>.
Tremal Naik wrote:
> Oh, yes, you're right. I'm using version 1.1, that's why probably I
> don't have that option available. Unfortunately I'm not allowed to
> upgrade to a newer version...

I suppose you are using ActionForms. Try to extend ActionForm overriding 
your reset method which will set the character encoding, before the 
parameters are processed. Something like this:

public class UTF8ActionForm extends ActionForm {
     public void reset(ActionMapping mapping,
             HttpServletRequest request) {
         super.reset(mapping, request);
         try {
             request.setCharacterEncoding("UTF-8");
         } catch (UnsupportedEncodingException ioe) {
             throw new RuntimeException(ioe.getMessage());
         }
     }
}

And then just instead of using ActionForm use UTF8ActionForm.

And try asking on Struts mailing list, there should be more people who 
solved the same problem.

Regards,
Ognjen

---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: UTF-8 charset encoding

Posted by Tremal Naik <tr...@gmail.com>.
2007/11/19, Ognjen Blagojevic <og...@etf.bg.ac.yu>:
> Which version of Struts are you using? 1.2.7 does support acceptCharset,
> as you can see here:


Oh, yes, you're right. I'm using version 1.1, that's why probably I
don't have that option available. Unfortunately I'm not allowed to
upgrade to a newer version...


Thanks, again

TN

---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: UTF-8 charset encoding

Posted by Ognjen Blagojevic <og...@etf.bg.ac.yu>.
Tremal Naik wrote:
> 2007/11/16, Ognjen Blagojevic <og...@etf.bg.ac.yu>:
>> Did you try to put acceptCharset="UTF-8" in the form tag?
> 
> well, I'm using Struts and it looks the html:form tag doesn't allow
> any acceptCharset attribute. I tried to set the enctype attribute, but
> with no effect.

Which version of Struts are you using? 1.2.7 does support acceptCharset, 
as you can see here:

http://struts.apache.org/1.2.7/userGuide/struts-html.html#form


Regards,
Ognjen

---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: UTF-8 charset encoding

Posted by Tremal Naik <tr...@gmail.com>.
2007/11/16, Ognjen Blagojevic <og...@etf.bg.ac.yu>:
> Did you try to put acceptCharset="UTF-8" in the form tag?

well, I'm using Struts and it looks the html:form tag doesn't allow
any acceptCharset attribute. I tried to set the enctype attribute, but
with no effect.

Thanks,

-- 
TREMALNAIK

---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: UTF-8 charset encoding

Posted by Ognjen Blagojevic <og...@etf.bg.ac.yu>.
Hi Tremal,

Tremal Naik wrote:
> all characters are displayed well in both browsers, the page encoding
> appears correctly set to UTF-8. The problems arises when I try to
> submit "strange" characters as currency symbols (euro, pound, yen,
> ...) in a form text box.

Did you try to put acceptCharset="UTF-8" in the form tag?

Regards,
Ognjen

---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: UTF-8 charset encoding

Posted by Tremal Naik <tr...@gmail.com>.
2007/11/16, Tremal Naik <tr...@gmail.com>:
> No, unfortunately the parameters are parsed before any filter is
> invoked. Hence, a flag is set on the request that avoids subsequent

I tried with  valve. It looks fine now.

But it's really annoying having to impose a default character encoding
using a valve. Isn't it possible to tell the browser to send the
encoding, somehow?

Thanks,

-- 
TREMALNAIK

---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: UTF-8 charset encoding

Posted by Tremal Naik <tr...@gmail.com>.
2007/11/16, Mohsen Saboorian <mo...@gmail.com>:
> I don't know if this is the best solution. You can create a filter for
> *.* in your web.xml, with the following piece of code:
> response.setCharacterEncoding("UTF-8");

No, unfortunately the parameters are parsed before any filter is
invoked. Hence, a flag is set on the request that avoids subsequent
parameters re-evaluation:

        parametersParsed = true;

When the filter sets the character encoding, it is too late.

thanks, anyway


-- 
TREMALNAIK

---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: UTF-8 charset encoding

Posted by Mohsen Saboorian <mo...@gmail.com>.
Hi,

I don't know if this is the best solution. You can create a filter for
*.* in your web.xml, with the following piece of code:
response.setCharacterEncoding("UTF-8");

Mohsen.

On Nov 16, 2007 4:44 PM, Tremal Naik <tr...@gmail.com> wrote:
> Hello Tomcat users,
> I'm developing an application in Jboss 4.0.2, which uses Tomcat 5.5.9
> as web tier.
>
> I'm trying to make Tomcat decoding the request body with the correct
> encoding. I have problems with IE and Firefox as well. The html page
> has the meta tag:
>
> <html>
> <head>
>         <meta http-equiv="Content-Type" content="text/html;charset=utf-8" >
>
> all characters are displayed well in both browsers, the page encoding
> appears correctly set to UTF-8. The problems arises when I try to
> submit "strange" characters as currency symbols (euro, pound, yen,
> ...) in a form text box.
>
> Debugging Tomcat source, I see the following in the
> org.apache.catalina.connector.Request: the parseParameters() method
> tries to get the correct encoding:
>
>         String enc = getCharacterEncoding();
>
> the ContentType.getCharsetFromContentType() tries to infer the correct
> character encoding from the request Content-Type header, but since it
> is not set it returns null:
>
>  public static String getCharsetFromContentType(String type) {
>         if (type == null) {
>             return null;
>         }
>         int semi = type.indexOf(";");
>         if (semi == -1) {
>             return null;
>         }
>
> Here the problem is the content type of the http request coming from
> the browser is as follows (as grabbed by an http tracer):
>
> Content-Type: application/x-www-form-urlencoded
>
> Hence, it doesn't contain any ";charset=utf-8" appended, as expected.
> The body looks to me correctly encoded in UTF-8 format.
>
> Thus, the parseParameters() parses the parameters with its default
> character encoding which is ISO-8859-1:
>
> parameters.setEncoding
>                 (org.apache.coyote.Constants.DEFAULT_CHARACTER_ENCODING);
>
> Now, I know this is not a Tomcat problem, but I don't know how to
> force the client to sent the correct information to the server. Any
> ideas how to solve this?
> It was nice also if I could set up Tomcat default encoding to UTF-8,
> without recompiling the source (I have some requirements on the
> Jboss/Tomcat: I can't use a modified or different version).
>
> By the way, I already tried to update the file
> $CATALINA_HOME/conf/server.xml for UTF-8 support by connectors with
> the parameter URIEncoding="UTF-8".
>
>
>
> Thanks for your help,
>
> --
> TREMALNAIK
>
> ---------------------------------------------------------------------
> To start a new topic, e-mail: users@tomcat.apache.org
> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> For additional commands, e-mail: users-help@tomcat.apache.org
>
>

---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org