You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tomcat.apache.org by Alec Yu <al...@msa.hinet.net> on 2001/05/12 07:03:58 UTC

[Proposal] Default Encoding option for JSP/Tomcat in server.xml or web.xml

I read some code in catalina & jasper, and found that:
There is a setCharacterEncoding() for servlet request now; but I greped all Tomcat
code, and found nowhere called it. It means, by default, Tomcat use a default encoding
of '8859_1'. There is no option in server.xml/web.xml for tomcat to set a default encoding
for a context/container(or whatever) to use a default encoding other than '8859_1'.

Also, the alternative (JSP compiling) encoding option in conf/web.xml for jasper
seems failed to work (at least, failed for JSP pages in big5 encoding).
When there is no '<% page contentType="text/html; charset=xxx" %>' in a JSP,
jasper use '8859_1' as its the JSP's default encoding, oops.

We are working on a product deploying JSP pages which targeting multiple
markets in Japan, Taiwan, and probably China mainland. Sure, when we maintain
our JSP pages (initially show messages in english, but should be able to handle
input in localized character encodings), we don't like to maintain 3 versions of
JSP pages with each version of them differed only in the page directive:
'<% page contentType="text/html; charset=xxx" %>'


And, I found Tomcat does byte->char typecast first and then char->byte typecast
back before converting bytes into a java string. Unfortunately, because the character
encoding is never changed from '8859_1' to some other customized one assigned
in somewhere other than in code.

This seems to work at first, as long as you don't treat strings read from GET/POST
parameters as Unicode strings, because they are NOT VALID UNICODE STRINGS.
Web output generated from servlets/JSP pages may be right, simply because contents
in these NOT VALID UNICODE STRINGS are converted into bytes again by simply
doing char->byte typecasting.

Oops! It goes too far. People can't just do internalization/localization in such a
"garbage in garbage out" solution. Maybe it looks right both in the input/output ends,
if you simply GET/POST something and out.println(xxx.getParameter("foo")).
But if you are doing something serious with character encodings other than 8859_1
(if Big5, GB2312 and Shift_JIS are for localization and not serious enough, how about
utf-8 character encoding? indeed, Tomcat garbaged GET/POST inputs in utf-8 encoding),
you must handle this problem.

Personally, I code my own connector to aim this problem. The connector works as a
bridge from Sun's Brazil web server (a light-weight web server in 100% java), Brazil
HTTP request objects are passed directly into the connector (rather than via some socket
protocl), such that the connector does configure servlets/JSP pages to use a default encoding
given by properties set in the Brazil configuration file, and it does URL encoding check against
raw strings input in GET/POST parameters in localized character encoding, as to make sure
Tomcat does right character conversions for these parameters. (the %xx URL decoding
code in parseParameters() in Tomcat 4 beta 3/4 works fine, but the byte->char/char->byte
code drops some characters) But there is no way to modify jasper's default compiling encoding,
except modify its code.



Re: [Proposal] Default Encoding option for JSP/Tomcat in server.xml or web.xml

Posted by "Craig R. McClanahan" <cr...@apache.org>.

On Sun, 13 May 2001, Alec Yu wrote:

> 
> The servlet/JSP specifications made me feel that:
> they only aimed at L10N problems, not I18N problems.
> 

I can understand your concerns.  However, the correct forum for addressing
them is the spec feedback addresses, rather than here:

  servletapi-feedback@eng.sun.com

  jsp-spec-comments@eng.sun.com

Tomcat implements the specifications -- it does not define them.  Both
specs are in "Proposed Final Draft 2" state right now, so it is getting
close to being too late to make any changes for the 2.3/1.2 versions, but
you would still want to make your concerns known.

As a member of the expert group that defined these specs (in the Java
Community Process, this is being managed by Java Specification Request
#53), I can tell you that internationalization concerns *were* addressed,
and several adjustments were made to servlet 2.3 and JSP 1.2 in order to
improve I18N support.  Given the way that HTTP is defined, and the
lackluster support for correct implementations of HTTP in many browsers,
there are no perfect answers.

> [snip] I just feel curious, why the standard specifications cost
> people here so much maintainance time, just because they don't allow
> us to specify default encodings for compilation time, input time and
> runtime once only in some few configuration files, but force us to
> specify them in every pages & every servlet code. Meanwhile, in this
> manner, as our products co-operate with those code/pages come from
> other people, we have to ask their developers: May you send us a copy
> of source code/pages? May you take concern on some character encodings
> other than your own using one? May you ......
> 

Sounds like great comments for the spec feedback addresses.

> What an I18 solution looks like this. Sure, UTF-8 greatly eased the
> problems on input & output, but it does not solve the maintainance
> problem on other people's code/pages. And, not everybody willing to
> take UTF-8 as their default encoding, because only few tools are being
> able to edit UTF-8 documents (Let's forget M$ FrontPage, it surely
> with poor support to JSPs; Dreamweaver is great, but lack of UTF-8
> support; Amaya has poor DBCS support, not mentioning JSP; even among
> plain text editors, there are few suppoting UTF-8).
> 

Yep ... there is no perfect solution :-).

> You know, lots of, if not most, JSP pages around the world come with
> no page contentType directives, many servlets do not even specify
> their own character encoding, or do not provide an option in some
> configuration files to do so. The real nightmare is not in our own
> servlets/pages, but in other people's.
> 

IMHO that is because most page and servlet authors haven't given
sufficient consideration to I18N.  The ability to set character encoding,
for example, has been there since the very early days of servlets.

> ps. I am a newbie, not knowing how to make code submission to Apache
> projects. I installed JAMES 1.2.1 on my personal web site, and found
> it garbaged 8-bit MIME mail headers. I fixed it, and put SMTP AUTH
> LOGIN function into its SMTP handler. (such that, you may put a
> matcher to allow mail relay by checking accounts, not by IP). I'd like
> to contribute such a feature to JAMES. What should I do without join
> Apache's membership?
> 
> 

The guidelines for code contributions to all Jakarta projects can be found
on the Jakarta web site, starting at this page:

	http://jakarta.apache.org/site/getinvolved.html

I would imagine that they will welcome your contributions.

Craig



Re: [Proposal] Default Encoding option for JSP/Tomcat in server.xml or web.xml

Posted by Alec Yu <al...@msa.hinet.net>.
From: "Craig R. McClanahan" <cr...@apache.org>
> Servlet Specification 2.3 (Proposed Final Draft 2), Section 5.4 (p. 44):
> 
>     'The default encoding of a response is "ISO-8859-1"
>     if none has been specified by the servlet programmer.'
I am a servlet programmer also,
why can't I specified it in the container configuration files...*giggle*

> Providing container-level overrides for this would seem to break the spec,
> and any application that depended on that features would not be portable
> to other containers.
Suppose we are developing a web product in JSP, targeting 3 markets
(say, Japan, Taiwan & Korea).

Meanwhile, our product co-operate with some other servlet/JSP-based
product(s) from 3rd party vendors. 

The concern is:
If there is no way to set a default encoding in a web.xml/server.xml or whatever
configuration files used by the servlet/jsp engine, then we have to, not only modify
our code & pages, but also those from other vendors.

More worse, how about those servlets come without source code?

Let's see a real example: (my personal web site)
Sun's Brazil web server acts as the front-end web server (because it's light weight,
responds faster), with my own brazil-to-tomcat connector (invoke servlets/jsp pages
via direct java calls, not via socket connections).

Everything is fine, until Jive (a free forum system in JSP & beans) involves into this system.
Wow. Jive can't handle Big5, Shift_JIS, GB2312 or anything else like utf-8; only ISO 8859-1
works fine.

Hell, should I modify Jive again and again and again, when Jive updates so often?
How about some new custom Jive skins from somewhere around the world?
How about other 3rd party JSP pages?

The servlet/JSP specifications made me feel that:
they only aimed at L10N problems, not I18N problems.

> > This seems to work at first, as long as you don't treat strings read
> > from GET/POST parameters as Unicode strings, because they are NOT
> > VALID UNICODE STRINGS. Web output generated from servlets/JSP pages
> > may be right, simply because contents in these NOT VALID UNICODE
> > STRINGS are converted into bytes again by simply doing char->byte
> > typecasting.
> For GET requests, there are not very many good solutions because the
> request itself does not include information about the character encoding
> that was used on the requset URI.
Yes, I read something years ago similar to this explaining about why a standard for
determining GET parameters not existing..

> Could you point me specifically to the byte->char/char->byte code that you
> are concerned about?
Hmm......Thank you for lots of explains.
Indeed, what I'm talking about is not broken.
After following the spec more closely, it's ok now.

> You are obviously free to do this kind of special connector, and/or modify
> Tomcat to meet your needs -- but you're also making yourself dependent on
> conventions that are contrary to the servlet and JSP specifications.  Any
> apps you write that depend on this behavior won't run on any other servers
> that implement the standards.  You might want to look at standards based
> alternatives to at least some of the issues that you have raised.
I just feel curious, why the standard specifications cost people here so much maintainance time,
just because they don't allow us to specify default encodings for compilation time, input time
and runtime once only in some few configuration files, but force us to specify them in every
pages & every servlet code. Meanwhile, in this manner, as our products co-operate with
those code/pages come from other people, we have to ask their developers:
May you send us a copy of source code/pages?
May you take concern on some character encodings other than your own using one?
May you ......

What an I18 solution looks like this.
Sure,  UTF-8 greatly eased the problems on input & output, but it does not solve
the maintainance problem on other people's code/pages. And, not everybody willing
to take UTF-8 as their default encoding, because only few tools are being able to
edit UTF-8 documents (Let's forget M$ FrontPage, it surely with poor support to JSPs;
Dreamweaver is great, but lack of UTF-8 support; Amaya has poor DBCS support,
not mentioning JSP; even among plain text editors, there are few suppoting UTF-8).

You know, lots of, if not most, JSP pages around the world come with no page contentType directives,
many servlets do not even specify their own character encoding, or do not provide an option in some
configuration files to do so. The real nightmare is not in our own servlets/pages, but in other people's.

ps.
I am a newbie, not knowing how to make code submission to Apache projects.
I installed JAMES 1.2.1 on my personal web site, and found it garbaged 8-bit MIME mail headers.
I fixed it, and put SMTP AUTH LOGIN function into its SMTP handler.
(such that, you may put a matcher to allow mail relay by checking accounts, not by IP).
I'd like to contribute such a feature to JAMES. What should I do without join Apache's membership?


Re: [Proposal] Default Encoding option for JSP/Tomcat in server.xml or web.xml

Posted by "Craig R. McClanahan" <cr...@apache.org>.

On Sat, 12 May 2001, Alec Yu wrote:

> I read some code in catalina & jasper, and found that: There is a
> setCharacterEncoding() for servlet request now; but I greped all
> Tomcat code, and found nowhere called it. It means, by default, Tomcat
> use a default encoding of '8859_1'. There is no option in
> server.xml/web.xml for tomcat to set a default encoding for a
> context/container(or whatever) to use a default encoding other than
> '8859_1'.
> 

Servlet Specification 2.3 (Proposed Final Draft 2), Section 5.4 (p. 44):

    'The default encoding of a response is "ISO-8859-1"
    if none has been specified by the servlet programmer.'

Providing container-level overrides for this would seem to break the spec,
and any application that depended on that features would not be portable
to other containers.


> Also, the alternative (JSP compiling) encoding option in conf/web.xml
> for jasper seems failed to work (at least, failed for JSP pages in
> big5 encoding). When there is no '<% page contentType="text/html;
> charset=xxx" %>' in a JSP, jasper use '8859_1' as its the JSP's
> default encoding, oops.
> 

Again, this is a spec requirement.  This time it's JSP 1.2 (Proposed Final
Draft 2), Section 2.10.1 (p. 52):

    'The CHARSET value of contentType is used as default if
    present, or ISO-8859-1 otherwise.'

> We are working on a product deploying JSP pages which targeting
> multiple markets in Japan, Taiwan, and probably China mainland. Sure,
> when we maintain our JSP pages (initially show messages in english,
> but should be able to handle input in localized character encodings),
> we don't like to maintain 3 versions of JSP pages with each version of
> them differed only in the page directive: '<% page
> contentType="text/html; charset=xxx" %>'
> 

In JSP 1.2, there is one new feature that can help in this situation.  You
can set the content type dynamically in a scriptlet or custom tag, as long
as the response has not yet been committed.  See the overall page 
lifecycle discussion in Section 2.7.

> 
> And, I found Tomcat does byte->char typecast first and then char->byte
> typecast back before converting bytes into a java string.
> Unfortunately, because the character encoding is never changed from
> '8859_1' to some other customized one assigned in somewhere other than
> in code.
> 

Are you talking about the output character encoding sent to the browser?  
You can set that (along with the content type) by calling

	response.setContentType("text/html; charset=xxxxx");

as long as this is done before the first buffer-full is flushed.

> This seems to work at first, as long as you don't treat strings read
> from GET/POST parameters as Unicode strings, because they are NOT
> VALID UNICODE STRINGS. Web output generated from servlets/JSP pages
> may be right, simply because contents in these NOT VALID UNICODE
> STRINGS are converted into bytes again by simply doing char->byte
> typecasting.
> 

For GET requests, there are not very many good solutions because the
request itself does not include information about the character encoding
that was used on the requset URI.

For POST requests, the request parameters will be parsed in the character
encoding specified by the browser (as part of the content type
header).  If they did not, a new feature in Servlet 2.3 lets you call
request.setCharacterEncoding() before trying to read any request
parameters, if the app knows what character encoding was used.

> Oops! It goes too far. People can't just do
> internalization/localization in such a "garbage in garbage out"
> solution. Maybe it looks right both in the input/output ends, if you
> simply GET/POST something and out.println(xxx.getParameter("foo")).
> But if you are doing something serious with character encodings other
> than 8859_1 (if Big5, GB2312 and Shift_JIS are for localization and
> not serious enough, how about utf-8 character encoding? indeed, Tomcat
> garbaged GET/POST inputs in utf-8 encoding), you must handle this
> problem.
> 
> Personally, I code my own connector to aim this problem. The connector
> works as a bridge from Sun's Brazil web server (a light-weight web
> server in 100% java), Brazil HTTP request objects are passed directly
> into the connector (rather than via some socket protocl), such that
> the connector does configure servlets/JSP pages to use a default
> encoding given by properties set in the Brazil configuration file, and
> it does URL encoding check against raw strings input in GET/POST
> parameters in localized character encoding, as to make sure Tomcat
> does right character conversions for these parameters. (the %xx URL
> decoding code in parseParameters() in Tomcat 4 beta 3/4 works fine,
> but the byte->char/char->byte code drops some characters) But there is
> no way to modify jasper's default compiling encoding, except modify
> its code.
> 

Could you point me specifically to the byte->char/char->byte code that you
are concerned about?

You are obviously free to do this kind of special connector, and/or modify
Tomcat to meet your needs -- but you're also making yourself dependent on
conventions that are contrary to the servlet and JSP specifications.  Any
apps you write that depend on this behavior won't run on any other servers
that implement the standards.  You might want to look at standards based
alternatives to at least some of the issues that you have raised.

Craig McClanahan