You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tomcat.apache.org by cm...@yahoo.com on 2000/09/24 02:34:06 UTC

Internationalization, Charsets and MessageBytes

Hi,

I need your feedback regarding this very important issue. We discussed
this few times before, and I think this should be implemented at least in
3.3...

The problem: decoding the request in the right charset.

Acording to HTTP/1.0 and 1.1, header names are required to be ASCII, but
the values and the URI can be in any charset. 

The second problem is that the charset may be specified as part of the
Content-Type header ( and any correct HTTP client should provie it - IIS
and Netscape  don't in many versions), or ( in servlet 2.3 ) it may be
specified by the servlet. 

In any case, the charset is known _after_ the request is read. As a
background, we read bytes from network - new String( byte[] ) will use the 
server default encoding ( can be != client encoding ). 

I guess we all know the problem, and I really need to get your feedback
regarding the solution I'm trying to implement in tomcat 3.3.

1. All request components will be read as MessageBytes. No String will be
generated or used during request and header parsing.

2. An CharsetInterceptor will attempt to guess the encoding ( first using
the standard Content-Type, then using various known heuristics - browser
type, accept header, etc).

3. If no conversion is detected, in servlet2.3 the servlet has a last
chance to set an encoding.

4. The byte->char conversion will happen late, when the servlet calls any
method returning String. MessageBytes.toString() will be called after 
MessageBytes.setCharset() with the right encoding.


It's not simple - a lot of code needs to be changed to implement that (
mostly in helpers and interceptors), but so far I don't know any better
solution. 

The solution has 2 more benefits - it saves memory ( 90% of the bytes will
not be used by the servlet, so no Strings need to be allocated, with all
request fields beeing recyclable components ). It's also faster, as it
avoids overhead and can be better tuned ( it's also fair - the code that
calls a method will pay the price for it).


Let me know what you think and if you have any alternative solution. I'm
planing to start working on this after I finish cleaning up core/facade
layers. 

We can delay this until post Tomcat 3.3 if you'ld like, but I am hoping
that after 3.3 we'll not have major core changes.

 
Costin