You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cocoon.apache.org by Antonio Gallardo <ag...@agssa.net> on 2004/05/29 17:11:52 UTC

[RT] About charsets (character encoding) and servlet API

Hi:

Not sure if this is part of the Carsten RT about the new Cocoon version. I
need to say I am far to be an authority in this area, but, I think this
time we need to discuss it:

Introduction
============

It is a fact, the world is moving to UTF-8. In many of the new development
requirements, there are words related to i18n and support for foreign
languages. Cocoon cannot stay out of this. The current approach inside
cocoon to manage the i18n is not going no where. People is still having
problem when trying to use UTF-8 inside the applications. And it is
becoming to be a strong lack of functionality. If we make a simple search
in mail archives for the keyword UTF-8 you found 2268 mails! It will grow
faster than ever if we don't solve this problem.

Note, I am aware of the i18n samples in Cocoon and the currents efforts
about how to solve it are being documented in:

http://wiki.cocoondev.org/Wiki.jsp?page=RequestParameterEncoding

But still there is a question in my mind:

Why we need to make a rocket science about this triviality?

I am also asking myself if the problem is a bug in Tomcat or to the
servlet API we are currenlty using?

The follow lines will explain why?

Revisiting the servlet API 2.2
==============================

Final Release: December 17th, 1999.

Cocoon for severals years (not sure if since his firsts days) is using the
servlet specification 2.2. The 2.2 servlet specification don't have a
clear policy about how to manage the i18n problem. Instead it let the
problem to the servlet developers (in this case, cocoon developers). And
this is why I am not sure if we can rant to Tomcat or other servlets
containers. I guess this is mainly why Tomcat is doing nothing about that.
We use Servlet API 2.2.

Facts:

1-In the Servlet API 2.2, the methods that parse parameter input ALWAYS
assume that it's sent as ISO 8859-1 (getParameter() etc).

2- ISO-8859-1 is the default encoding of HTTP!

                              - 0 -

In that way if we send characters, says in UTF-8. It create a String
containing the correct bytes but incorrect charset!

And this is why Cocoon needs a hack (or a "fix") to convert the bytes to a
string using the correct charset. Bruno showed me that we use something
like:

new String(value.getBytes("8859_1"), "utf-8")

Knowing the above facts, let me describe what happen when the browser
sends parameters in UTF-8:

The browser encodes each character byte value as a hexadecimal string
using the encoding for the page (in this example, UTF-8).

Then server (the servlet container) interprets these character values and
always assumes they are 8859-1 byte values! So it creates a Unicode string
based on the byte values interpreted as 8859-1. Since the 8859-1
assumption is made by the container, the Cocon hack (or "fix") is needed
independently of the platform we run it on.

But is is rocket science! Mainly when we already have new servlets APIs
that allow to manage it in a more elegant way....


Moving to a new servlet specification?
======================================

Reading about that I found:

Since servlet specification 2.3 (Final Release August 13th, 2001), Sun
started to solve some of the problems related to this topic. This API
provides the support for handling foreign language form submittals. In API
2.3 we can tell to the server the request's character encoding desired
using the method:

request.setCharacterEncoding(String encoding).

So to retrieve UTF-8 parameters we can simple use:

req.setCharacterEncoding("UTF-8");      // Set the charset to UTF-8
String name = req.getParameter("name"); // Read the parameter

This is great, right? Lets see what we have now....

Servlet API 2.4
===============

Reading I found:

Introduced in November 24, 2003
Minimum J2SE required: 1.3

This API, the ServletResponse interface (and the ServletResponseWrapper)
add a new method interesting method to us:

1- setCharacterEncoding(String encoding): Sets the response's character
encoding. This method provides an alternative to passing a charset
parameter to setContentType(String) or passing a Locale to
setLocale(Locale).

With this method, we can avoid setting the charset using
setContentType("text/html; charset=UTF-8") call.

Servlet 2.4 also introduces a new <locale-encoding-mapping-list> element
in the web.xml  deployment descriptor to let the deployer assign
locale-to-charset mappings outside servlet code. It looks like this:

<locale-encoding-mapping-list>
  <locale-encoding-mapping>
    <locale>ja</locale>
    <encoding>Shift_JIS</encoding>
  </locale-encoding-mapping>
  <locale-encoding-mapping>
    <locale>zh_TW</locale>
    <encoding>Big5</encoding>
  </locale-encoding-mapping>
</locale-encoding-mapping-list>

Now within this Web application, any response assigned to the ja locale
uses the Shift_JIS charset, and any assigned to the zh_TW Chinese/Taiwan
locale uses the Big5 charset. These values could later be changed to UTF-8
when it grows more popular among clients. Any locales not mentioned in the
list will use the container-specific defaults as before.

Conclusion
==========

I think most of us are using servlet containers with servlet specs 2.3 or
superior. In that way, I think it is time to move to a higher servlet API
specs? I think just this little things are enough.

Please tell me WDYT?

Best Regards,

Antonio Gallardo

Further reading:

[1] "Servlet 2.3: New features exposed" -
http://www.javaworld.com/javaworld/jw-01-2001/jw-0126-servletapi.html

[2] "Servlet 2.4: What's in store" -
http://www.javaworld.com/javaworld/jw-03-2003/jw-0328-servlet.html


Re: [RT] About charsets (character encoding) and servlet API

Posted by Bruno Dumon <br...@outerthought.org>.
On Sun, 2004-05-30 at 03:08, Pier Fumagalli wrote:
> On 29 May 2004, at 16:11, Antonio Gallardo wrote:
> >
> > I think most of us are using servlet containers with servlet specs 2.3 
> > or
> > superior. In that way, I think it is time to move to a higher servlet 
> > API
> > specs? I think just this little things are enough.
> 
> I've been doing i18n work on Servlets for a _very_ long time and, dude, 
> I've never seen a problem with the API ever...
> 
> Let's split the problem in three parts: headers and body and URLs:

<snip/>

> 
> -----
> BODY:
> -----
> 
> RFC-2616 is _very_ clear at this point, if you don't specify the 
> charset token in the "Content-Type" header, and you specify (or imply) 
> that the body is "text/something" you SHOULD assume that you're 
> receiving / sending text encoded in ISO-8859-1...
> 
> Again, I seriously don't think that servlet containers check for the 
> encoding of the request body when the content type is 
> "application/x-www-form-urlencoded", because I _suppose_ that given 
> that it doesn't start with "text/..." they ignore the whole shabang...
> 
> So, I believe that in some cases, the encoding of parameters returned 
> by servlet containers MIGHT be wrong (but I ain't sure, haven't checked 
> that lately).
> 
> When you send, on the other hand, the servlet API doesn't have much 
> functionalities until 2.4 to set the charset encoding of the response, 
> but that _really_ affected only stupid JSPs which were never though 
> right anyway...
> 
> In Cocoon (I hope) we should never rely on the "getWriter()" returned 
> by the servlet container but ALWAYS use a "getOutputStream()" and set 
> ALWAYS the content type with the proper "charset" token...
> 
> If we don't we're kinda violating 3.4.1 of RFC-2616 as it says that one 
> SHOULD always put the charset in there (if relevant, of course).

AFAIK we currently don't set it, and that's causing some problems with
certain Tomcat versions who by default set it to ISO-8859-1.

> 
> So, the problem is only in reading parameters, and that should be fixed 
> at the servlet container level.

Servlet containers can't do much about it, since browsers don't tell in
which encoding they send their data (I know they should, but seems they
don't do it). All browsers seem to keep to the convention to send them
in the same encoding as the page containing the form.

As Antonio mentioned, with servlet 2.3 it's possible to do something
like

req.setCharacterEncoding("UTF-8");

to make them decode it in the encoding you want. Only thing with this
is, that this should be called before any request parameter is read.
Thus that means somewhere in the beginning of the cocoon servlet.

This also means that, if there are different parts of your app that use
different encodings (because they target applications/devices that don't
understand eg UTF-8), there's no easy way to change it, except by using
the decode/encode trick. So we'd need to keep that system in place
anyhow.

So basically it comes down to:

* what we have in place today works just fine

* by using req.setCharacterEncoding, we would gain some cpu cycles and
some memory by avoiding the need for the recode trick (in most cases).

* but this requires us to set the servlet 2.3 spec as minimum
requirement

-- 
Bruno Dumon                             http://outerthought.org/
Outerthought - Open Source, Java & XML Competence Support Center
bruno@outerthought.org                          bruno@apache.org


Re: [RT] About charsets (character encoding) and servlet API

Posted by Pier Fumagalli <pi...@betaversion.org>.
On 29 May 2004, at 16:11, Antonio Gallardo wrote:
>
> I think most of us are using servlet containers with servlet specs 2.3 
> or
> superior. In that way, I think it is time to move to a higher servlet 
> API
> specs? I think just this little things are enough.

I've been doing i18n work on Servlets for a _very_ long time and, dude, 
I've never seen a problem with the API ever...

Let's split the problem in three parts: headers and body and URLs:

--------
HEADERS:
--------

Now, the HTTP spec defines that a header needs to follow the RFC-822 
section 3.1 specification, therefore (I'm going on memory here, not 
cross checking) the header name must be composed of only a strict 
subset of US-ASCII characters, and the header value can ONLY be made up 
of ISO-88559-1 characters.

No problemo here...

At around page 16 of the RFC-2616 Roy also mentions that IF you want to 
encode something in headers that IS NOT encodable in ISO-8859-1, you 
gotta follow RFC-2047 (Mime Part 3) which defines clearly how such 
values are encoded...

Now, when we do a setHeader in the response, or do a getHeader from the 
request, the servlet container SHOULD parse/encode out the values in 
the correct way, although I've never seen any of them doing it (they 
simply ignore the whole shabang and use ISO-8859-1 for both header 
names and values and don't do any additional parsing/engoding.

Bug in the servlet containers...

-----
BODY:
-----

RFC-2616 is _very_ clear at this point, if you don't specify the 
charset token in the "Content-Type" header, and you specify (or imply) 
that the body is "text/something" you SHOULD assume that you're 
receiving / sending text encoded in ISO-8859-1...

Again, I seriously don't think that servlet containers check for the 
encoding of the request body when the content type is 
"application/x-www-form-urlencoded", because I _suppose_ that given 
that it doesn't start with "text/..." they ignore the whole shabang...

So, I believe that in some cases, the encoding of parameters returned 
by servlet containers MIGHT be wrong (but I ain't sure, haven't checked 
that lately).

When you send, on the other hand, the servlet API doesn't have much 
functionalities until 2.4 to set the charset encoding of the response, 
but that _really_ affected only stupid JSPs which were never though 
right anyway...

In Cocoon (I hope) we should never rely on the "getWriter()" returned 
by the servlet container but ALWAYS use a "getOutputStream()" and set 
ALWAYS the content type with the proper "charset" token...

If we don't we're kinda violating 3.4.1 of RFC-2616 as it says that one 
SHOULD always put the charset in there (if relevant, of course).

So, the problem is only in reading parameters, and that should be fixed 
at the servlet container level.

----
URL:
----

URLs are important as sometimes the request parameters are passed as 
query string attached to them...

Initially they were defined on US-ASCII and/or ISO-8859-1 (can't 
remember which one exactly) and that all non-printable characters had 
to be encoded with the usual percent-number-number format...

Great...

Between the W3C and RFC-2718 someone decided (at the end of the whole 
discussion) that URLs, in their internationalizable format only had to 
change in one aspect: the character encoding.

So, an URL nowadays (tested on my girlfriend's Jappo-Internet-Explorer) 
are sequences of bytes representing a string encoded in UTF-8, and the 
same rule applies of encoding the characters outside of the 
originally-defined printable ones with the usual percent-number-number 
re-encoding...

Again, I seriously don't think that any servlet container does this 
check, so, if we get wrong request parameters when someone browses in 
Japanese and posts a GET form, is not our fault...

-----------
CONCLUSION:
-----------

I believe Jon Postel once said "be strict in what you send, be liberal 
in what you accept" and this principle has been forgotten by the 
servlet-container implementors...

We can be strict as much as we can by sending the right stuff (as the 
servlet API allows us to do it by using OutputStream(s) instead of 
Writer), but we cannot be liberal in what we accept as URLs and request 
parameters are already pre-parsed for us into nice unicode-based Java 
String(s).

As far as I can see (and by the "trick" you outlined)

new String(value.getBytes("8859_1"), "utf-8")

servlet containers simply ignore that there's a world out there that 
DOES NOT speak english, and cut shortcuts to increase their parsing 
speed...

Unfortunately, there's not much we can do (apart from brutal hacks like 
the one mentioned above) to get parameters from my girlfriend's 
Jappo-browser.

One thing we could do, though, is to make sure that the communities 
building our servlet container of choice are aware of those problems, 
so, rather than reinventing hacks in Cocoon, I'd say, post those issues 
as bugs for Tomcat and Jetty and let them sort out the whole mess...

It ain't our fault, and unfortunately, we can only properly fix only 
one side of the story, what we send...

	Pier