You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tomcat.apache.org by Adalbert Wysocki <wa...@imediation.com> on 2001/02/14 15:26:19 UTC

RE: charset used for parameters decoding on HTTP request Tomcat3. x,4

> > You will still need to fix the actual parameter parsing routine to delay
> > applying the encoding until the name and parameter are parsed out of the
> > input stream...
> 
> Yes, most of this is already done. It also has a very nice performance
> implication - since the String is converted and alocated only when and if
> it's needed. 
> 
> The only missing part is the "internationalization" module that detects
> the encoding ( charset and accept-language parsing doesn't look good
> either in the current code ), and putting the pieces togheter.

The problem is that browsers do not send the charset used to encode the
form's parameters; but they sent the request with the ContentType header
application/x-www-form-urlencoded. The charset should follow the encoding
type ex: "application/x-www-form-urlencoded; charset=UTF8" but in most of
cases does not.

Re: charset used for parameters decoding on HTTP request Tomcat3.x,4

Posted by Kazuhiro Kazama <ka...@ingrid.org>.
From: Hans Bergsten <ha...@gefionsoftware.com>
Subject: Re: charset used for parameters decoding on HTTP request Tomcat3.x,4
Date: Wed, 14 Feb 2001 11:47:17 -0800
Message-ID: <3A...@gefionsoftware.com>
> I'm afraid I have to -1 this proposal. Sure, it may be a nice feature but it's
> not defined by Servlet 2.2. And, for better or for worse, TC 3.x is the
> Reference 
> Implementation for Servlet 2.2. If we add this behavior to TC 3.x, a servlet
> that takes advantage of it will not be portable to other spec compliant 2.2
> containers.

Agreed.

Some vendor surly has already introduced their own encoding detection
methods which Costin mentioned. But the detail of detection method
isn't opened and it caused breakage under a complicated environment.

Servlet 2.3 will introduce setCharacterEncoding() method. This is a
simple, but I think this is a good solution.

Although some i18n problems are solved in Servlet 2.3 and JSP 1.2, it
is inappropriate to introduce a new spec. I (and perhaps all japanese)
hope to transition to Servlet 2.3 and JSP 1.2. It is better to use
Servlet 2.3 spec in Tomcat 3.3 ... Is it exceed the limit of Tomcat
3.3?

From: Adalbert Wysocki <wa...@imediation.com>
Subject: RE: charset used for parameters decoding on HTTP request Tomcat3.	x,4
Date: Wed, 14 Feb 2001 14:26:19 -0000
Message-ID: <9B...@PARSV011>
>  * we suppose that the request's parameters encoding is the one used for the
> response to this request content encoding. If the servlet processing
> generates a result page encoded with Shift_JIS charset, it is reasonnable to
> suppose that the incoming form data used for the page generation is encoded
> with the Shift_JIS charset.

There is a exception. In Japan, some systems sometime accept another
charset because JIS character set can be encoded in ISO-2022-JP,
EUC-JP and Shift_JIS, and user-defined HTML forms may be encoded in
another charset. In this case, they uses a "JISAutoDetect" converter
that has auto recognition facility for JIS variant character
encodings.

From: Adalbert Wysocki <wa...@imediation.com>
Subject: charset used for parameters decoding on HTTP request Tomcat3.x,4
Date: Mon, 12 Feb 2001 18:00:14 -0000
Message-ID: <9B...@PARSV011>
> NB: A solution would be to overwrite the system property "file.encoding" on
> the command line. But on exotic platforms (such as Korean), overwriting the

In Japan, another solution is used:

    s = new String(s.getBytes("iso-8859-1"), "Shift_JIS");

This method is dirty. But it don't change a Java default character
encoding. And it can work on Servlet 2.3 based container because
Servlet 2.3 defines the default value is "iso-8859-1".

Kazuhiro Kazama (kazama@ingrid.org)		NTT Network Innovation Laboratories

Re: charset used for parameters decoding on HTTP request Tomcat3.x,4

Posted by Hans Bergsten <ha...@gefionsoftware.com>.
> Adalbert Wysocki wrote:
> 
> > > You will still need to fix the actual parameter parsing routine to delay
> > > applying the encoding until the name and parameter are parsed out of the
> > > input stream...
> >
> > Yes, most of this is already done. It also has a very nice performance
> > implication - since the String is converted and alocated only when and if
> > it's needed.
> >
> > The only missing part is the "internationalization" module that detects
> > the encoding ( charset and accept-language parsing doesn't look good
> > either in the current code ), and putting the pieces togheter.
> 
> The problem is that browsers do not send the charset used to encode the form's
> parameters; but they sent the request with the ContentType header
> application/x-www-form-urlencoded. The charset should follow the encoding type
> ex: "application/x-www-form-urlencoded; charset=UTF8" but in most of cases
> does not.

Right.

> From my point of view instead of implementing a routine in charge of analysing
> the request header to extract the data's encoding charset (few chances for it
> to really work), It would be better to adopt the following policy:
> 
>  * we suppose that the request's parameters encoding is the one used for the
> response to this request content encoding. If the servlet processing generates
> a result page encoded with Shift_JIS charset, it is reasonnable to suppose
> that the incoming form data used for the page generation is encoded with the
> Shift_JIS charset.
> 
>  * While the parameters decoding, instead of suppose that one url's encoded
> entity (%XX) is a caracter to be decoded, we append all characters as bytes
> and then we decode the full parameter string using the encoding set on the
> response
> (javax.servlet.http.HttpServletResponse.setCharacterEncoding(String)).
> 
>  * The response encoding must be set on the response object before the first
> call to one of following function (then parameters are parsed):
> 
>     - javax.servlet.http.HttpServletRequest.getParameter(String)
>     - javax.servlet.http.HttpServletRequest.getParameterNames()
>     - javax.servlet.http.HttpServletRequest.getParameterValues(String)
> 
>    If the charset was not set on the response object when one of the functions
> listed above is called then parameters are decoded using the default JVM's
> encoding.

I'm afraid I have to -1 this proposal. Sure, it may be a nice feature but it's
not defined by Servlet 2.2. And, for better or for worse, TC 3.x is the
Reference 
Implementation for Servlet 2.2. If we add this behavior to TC 3.x, a servlet
that takes advantage of it will not be portable to other spec compliant 2.2
containers.

Servlet 2.3 defines how to deal with this, and this proposal is not in line
with what's in Servlet 2.3 PFD. It would be bad idea to add a solution in
the RI for 2.2 that's not compatible with the speced behavior for 2.3.

> NB: This policy is used in Caucho's Resin servlet engine and it works fine.
>     Modifications in Tomcat code are basic and the risk to impact the core
> processing is weak

Container vendors are free to add features, even though it's probably not a 
good idea for them to add features that breaks spec compliance ;-)

Hans
-- 
Hans Bergsten		hans@gefionsoftware.com
Gefion Software		http://www.gefionsoftware.com
Author of JavaServer Pages (O'Reilly), http://TheJSPBook.com

RE: charset used for parameters decoding on HTTP request Tomcat3. x,4

Posted by cm...@yahoo.com.
> 
> The problem is that browsers do not send the charset used to encode the
> form's parameters; but they sent the request with the ContentType header
> application/x-www-form-urlencoded. The charset should follow the encoding
> type ex: "application/x-www-form-urlencoded; charset=UTF8" but in most of
> cases does not.

I know. But that's the standard, and we have to follow it first.
If that fails ( and will - in most browsers that ignore the standards ) -
then we can try workarounds. 


> >From my point of view instead of implementing a routine in charge of
> analysing the request header to extract the data's encoding charset (few
> chances for it to really work), It would be better to adopt the following
> policy:

There is no "instead" here - in addition of the ";charset=" we can do
many things.


>  * we suppose that the request's parameters encoding is the one used for the
> response to this request content encoding. If the servlet processing
> generates a result page encoded with Shift_JIS charset, it is reasonnable to
> suppose that the incoming form data used for the page generation is encoded
> with the Shift_JIS charset.
>...
> (javax.servlet.http.HttpServletResponse.setCharacterEncoding(String)).
>...

That's a good idea - thanks Adalbert. 

There are other few tricks we can try ( in addition to this one ), and in
time we can hope that browsers will follow the standards.

BTW, another small improvement would be to specify an encoding per
application ( instead of defaulting to the platform or UTF).
And one may guess the charset from the Accept-Language ( in some cases ).
A very common mechanism seems to be a "charset" parameter in the request (
it seems there it is possible to do a javascript trick in the page to add
a hidden param with the current browser encoding ).

I'll start working on that in 1-2 weeks, and any sugestion ( like this
one ) will help.

Costin