You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tomcat.apache.org by "Dmitry I. Platonoff" <dp...@descartes.com> on 2000/04/14 21:25:09 UTC

Character encoding problem review

Hello!


WARNING: this is going to be a really large posting :)  It might be boring
for some people, but I wanted it to be coherent and complete -- to make it
easier to understand, and in hope that other readers will be find some
useful concepts or ideas in this review.



There was a number of discussions lately about the problems with incorrect
character encodings. In common, people often face the situation, when the
parameters of a servlet (which usually come from a form submitted by user)
are parsed and built using the wrong character encoding, which results in
complete and unrecoverable loss of any text information supplied by user.

Unfortunately, most of these discussions have ended with no result at all.
Yes, people agree, that there IS a need in some kind of infrastructure,
which would somehow magically handle the problem. But nobody seems to be
able (or wants to be able) to offer a solution for the problem, which
caused a real pain for the developers all over the non-Roman speaking world
for years. People are forced to find their own solutions and workarounds
again and again, which cost them time and money, and I wonder how many
times the wheel has been invented because of just one design flaw.

My intention is to try to provide an explanation of the problem itself, to
review the existing implementations, and to start a discussion about how
the problem could be solved in the future.


1. INTRODUCTION.

I don't know who invented the ASCII table. But I want these people to be
sorry for what they did. :) The lion's share of all the i18n problems we
have is caused by this perfect example of selfishness and ignorance.

This is supposed to be a joke, and I do understand the conditions and
limitations people had back then, but this is just one more sad example of
what will happen if we forget to think about the interests of others. And
we will have to clean up this particular mess for years still.


2. THE PROBLEM.

The problem appears when we face the need to translate the 8-bit stream of
the user input to unicode characters. All we need to know is which
character encoding to use.


3. THE EXCUSE.

The common excuse is: "we can never know how to determine the encoding, so
we better just leave it as it is, with no means to help it somehow".


The fist part is true, quite as often. However, there are ways to guess or
decide, which encoding is it. Moreover, nobody asks for this part to be
implemented. But the infrastructure SHOULD be built aware of the problem,
and should be able to use the encoding once it's known.


4. THEORY.

We can always use one constant encoding or make certain assumption, such
as:
 - use the encoding specified in the "Content-Type" header entry;
 - use the JVM default encoding;
 - use the encoding we use for output documents;
 - use the encoding we suspect the user to have;
 ...and so on.

4.1. We could be selfish (which implies constant encoding set in one or the
other way). It does work fine in the western world (I don't want to blame
anybody, I live there myself, so I just state the facts :). And it does
work in the rest of the world, unless we need to build a multi-language
site. Another example is when our single-language site meets the browser
which does not support the encoding we use (but does support the
alternative).

4.2. The "content-type" supplied chatset is ideal. But in the real life
almost nobody uses it, which we must blame the browser developers for.

4.3. The use of JVM default encoding is tricky. It might work for
single-language sites, but it's still the constant-encoding solution, which
problems are described in 4.1. Another issue is that the default JVM
encoding might not be the same as our document encoding. For example, the
Java machine, when being run under Cyrillic Windows, has the default
encoding Cp1252, which has nothing in common not only with the standard
document encoding for the Cyrillic world, but it's even different from the
default encoding in Windows. As an option, you can explicitly set the
desired encoding at the JVM startup, which is a common workaround for this
particular problem, but it still doesn't affect the big picture.

4.4. We usually can assume that a good browser would return the form data
in the same encoding it has received our document. But first how do we know
which encoding we were using for this browser (unless it is stored
somewhere, e.g. in the session), and second it is still a constant-encoding
solution in the most cases.

4.5. We might also guess which encoding is it, using some additional
information, such as accept-charset and accept-language parameters, browser
version, hostname, port and so on. Some web servers do this succesfully,
the famous "Russian Apache" is a good example -- it is an Apache module
with a set of sophisticated filters, which take care of all the charset
translations, allowing you to stick to one particular encoding in the Java
part of your application.


The point of all the situations, mentioned above, is: nobody knows how to
determine the character set for the particular document of the particular
site better than the site developer. And there's enough third-party tools
to ease this task for multi-encoding or multi-language sites, such as the
Russian Apache or various servlet wrappers and patches you may find,
especially in Eastern newsgroups.

Therefore, the servlet environment by itself should not care about guessing
the encoding. But what it should do at the first place is to provide a
proper use of this encoding once it's available, and make room for the
developer to alter this process.

Do the present implementations allow that? They not even close, and the
following sections shows why.


5. IMPLEMENTATIONS.

A study, performed by me and Eugen Kuleshov, has showed some flaws in the
current servlet implementations. We were using the sources of Tomcat build
3.1 beta 1 with servlets 2.2 sources included. We've also taken a look at
JServ and Resin to compare, but I'd like to review Tomcat first, since, as
I assume, it's supposed to be a reference implementation. 

5.1. There are parameter parsers in javax.servlet.http.HttpUtils, such as
parseQueryString() and parsePostData().

These parsers do their job with no regards to the character encoding
whatsoever. Since char operations are used, it means that the result would
be translated according to the JVM default encoding. But as I mentioned
before, it will only be valid, if the document encoding match the JVM
default, which is often not true.

Moreover, the post data stream parser converts the stream data explicitly
using the ISO8859-1 encoding. Normally, this should not produce any harm,
since no high ASCII characters allowed in the data stream, but
unfortunately certain browsers leave raw characters in the URL-encoded
strings (in particular, Opera 3.x had this bug), and the use of one static
encoding would ruin them for sure.

5.2. Request header lines in Tomcat are being read and converted using the
ISO8859-1 encoding explicitly. This would again ruin the queryString,
although I'm not so concerned about that, since there should not be any raw
characters to ruin. The other thing which might be a problem is cookies.
I'm not quite sure if the national characters are allowed in cookies, they
must be url-encoded, but certainly there will be a problem, because all the
national characters will be spoiled (if any), and otherwise, there's simply
no code in Tomcat to decode them -- this only can be done manually. BTW,
JServ has this code, and cookies are decoded properly. But again, with no
regard to character encoding.

5.3. Tomcat is somewhat aware of the charset existance, at least it uses
the encoding supplied with "Content-Type" header in req.getReader()/
res.getWriter() methods. But why not use it in parameter parsers as well?

5.4. The extraction of charset seems to be a mistery too. JServ was
returning the actual charset or ISO8859-1, while Tomcat returns the charset
or null. Personally, I like null better -- it's more honest :), while the
API spec says it should be ISO. And of course the same encoding is used for
the reader/writer creation.

Why ISO, why not a default encoding? It would at least solve some troubles
for the servers which use a single encoding. Tell me, why should I have the
reader created in the ISO charset, if someone's stupid browser was unable
to specify the proper encoding? Of course, I can do it manually, by using
       new InputStreamReader(req.getInputStream(), myEncoding);
but what's the use of getReader() then?


I'm not trying to argue here, I just want to understand -- maybe there are
reasons for this. Or it's just a bad design.


6. PROPOSALS.

What a servlet request really needs is a setCharacterEncoding() concept. It
might be used by the server itself, by any filters, if those will be
implemented in the future, or by a custom servlet superclass -- it doesn't
matter. Since the parameters are only parsed on demand, there's a lot of
possibilities to determine and set the right encoding before those would be
parsed.

Afterwards, if such an encoding is set, all the data parsers should be
using it for character translation -- for the query string, POST data
stream, cookies, etc.

There was a thread in the end of March in the servlet-interest list, with
the discussion of this problem between Jason Hunter and Vyacheslav Pedak.
It ended with nothing as usual -- everybody's agreed that it is a commonly
recognized problem, that there is a solution such as to convert the data
into a byte array, and parse it then. BTW, this way also appears to be more
efficient in terms of speed and resources than the existing one.

What I don't understand is why everyone should "build a little
infrastructure to do it yourself" for several years straight, while the
problem is known (and causes a LOT of pain) and the solution is proven
well. Isn't that the same as selling cars with no finish, just raw metal,
but including a can of paint, saying that the dealer doesn't know which
color your wife's favourite dress is (and considering that there's only one
choice of paint at all :).

I don't want to paint, and I can handle the color, I just want to drive...
There must be an infrastructure and the proper implementation in place.
Nobody's asking for a "make everyone happy" solution. Those who don't want
to be happy may still go their way.

If the sample or test implementation is needed, there's a number of request
wrapper and parameter manager classes, which have been created for several
open projects, and have proven themselves to be a reliable solution. We can
take a look at their source code here as a subject for the further
discussion.


Please excuse me for the size of this posting. But at least you can
imagine, how sick I am of this problem  :)


Sincerely,
Dmitry I. Platonoff (dplatonoff@descartes.com)

------------------------------------------
Software Engineer -- Core Services Group
Descartes Systems Group Inc.
(519) 746-6114 x2219
http://www.descartes.com/