You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@struts.apache.org by Paul Darling <pa...@tumbleweed.com> on 2000/12/28 23:57:07 UTC

Struts handling of multi-lingual input/output strings

I'm new to the list, so my apologies if there has already been a discussion
of this topic.

I'd like to understand how Struts handles different character encodings that
may be used by a Web browser.  

When Struts calls setter methods to set string values, I assume that the
strings will in all cases be encoded as Unicode strings.  Is this correct?
This implies that Struts converts to Unicode from whatever encoding is used
to input strings values in the form presented to the browser user?  I can't
find any evidence of this in the Struts code.   

Doe't Stuts use the pageEncoding value of the .jsp file to determine the
character encoding of input strings, and also convert output strings  from
Unicode to the encoding specified in the form used in a response?

Thanks for any help,

Paul Darling

Re: Struts handling of multi-lingual input/output strings

Posted by "Craig R. McClanahan" <Cr...@eng.sun.com>.

Michael Westbay wrote:

> Darling-san wrote:
>
> > I'd like to understand how Struts handles different character encodings that
> > may be used by a Web browser.
>
> This is something that I've been wondering about myself.
>

As one of the authors of a servlet container (Tomcat :-), I can offer some
insight
into what really happens.

>
> > When Struts calls setter methods to set string values, I assume that the
> > strings will in all cases be encoded as Unicode strings.  Is this correct?
>
> That was my assumption as well.  But after experimenting a bit, I've found
> that it may not be the case.  Strings taken from properties files (and within
> jsp pages for that matter) in the system's default character set (EUC-JP in
> my  case) get passed through fine.  Doing a native2ascii on the properties and
> using them doesn't.  Using Shift_JIS encoding in .jsp and/or properties also
> doesn't pass through correctly.
>

It is correct to say that, by the time property setters are called, we are
talking
about Unicode strings.  That is because Java uses Unicode internally for *all*
strings.

The issue, though, is "what conversion is used to convert request parameters
into
Unicode strings?"  Unfortunately, the answer varies depending on circumstances,
and is not always satisfactory.

If you are doing a POST request, the Reader used to process the request
parameters
uses the character encoding specified as part of the "Content-Type" header, if
there is one.  If not, the platform default (often ISO-8859-1) is used.

If you are doing a GET request, there is no mechanism within the HTTP protocol
specification to indicate what character encoding should be used.  :-(

In the servlet 2.3 specification, a new method was added to the ServletRequest
interface -- setCharacterEncoding().  This can be used by a servlet that wants
to
choose what character encoding to employ, no matter what the request says or
does
not say.  As above, it will only influence what happens on a POST.  And, you
have
to call it before any of the getParameter() family of methods are called.

>
> What I'm not sure about, though, is if the Reader used in getting the
> properties is making the coversion to Unicode or not.  I was under the
> impression that it was simply a ResourceBundle underneith - which has always
> worked properly with native2ascii-ed files before.  This is an area that I
> keep wanting to look into further - if I can find the time.
>

For properties files, this is correct -- you need to run native2ascii first to
make sure that they are read correctly.

>
> > Doesn't Stuts use the pageEncoding value of the .jsp file to determine the
> > character encoding of input strings, and also convert output strings  from
> > Unicode to the encoding specified in the form used in a response?
>
> I experimented a bit, but couldn't find any evidence to support this.  With
> Japanese it's especially difficult, since there are multiple character sets
> possible for the ja_JP locale.  For input strings, I usually use the Japanese
> "autodetect" encoding.  But I haven't found how to specify the output
> encoding.
>

The "pageEncoding" attribute is new in JSP 1.2, and defines the character
encoding
of the JSP page itself, as it is read by the compiler.  It has nothing to do
with
how an incoming request to that page is interpreted.

>
> With 2.0 servlets, I would just take the output stream and apply an encoding
> to it by hand (defined in a properties file).  But with Struts, it doesn't
> appear that one can change the encoding of the output in mid-stream - so I
> haven't been able to have any control over it.  And, yes, I've tried defining
> the encoding in the xml header at the top of the jsp pages.  That doesn't
> appear to make any difference.
>

You should be able to set the character encoding for your *output* of a JSP page
by saying something like:

    <%@ page ... contentType="text/html;charset=EUC-JP" ... %>

but, again, this has nothing to do with how subsequent *input* is interpreted.

>
> I'm really not sure if this is a Struts problem or a Tomcat (or whatever
> application server you're using) problem.
>
> What I'm currently thinking is that I'll have to chain an encoding filter
> to the Writer in order to gain any control over its output.  But if the
> Writer is already doing a coversion, I'm afraid that it'll all get garbled
> in the end.
>
> For the time being, encoding everything (properties and jsps) in EUC works.
> But it's something I'd like to understand better before getting to the point
> where I have to ship something.
>
> --
> Michael Westbay
>

Craig McClanahan

Re: Struts handling of multi-lingual input/output strings

Posted by Michael Westbay <we...@seaple.icc.ne.jp>.

Darling-san wrote:

> I'd like to understand how Struts handles different character encodings that
> may be used by a Web browser.  

This is something that I've been wondering about myself.

> When Struts calls setter methods to set string values, I assume that the
> strings will in all cases be encoded as Unicode strings.  Is this correct?

That was my assumption as well.  But after experimenting a bit, I've found
that it may not be the case.  Strings taken from properties files (and within
jsp pages for that matter) in the system's default character set (EUC-JP in
my  case) get passed through fine.  Doing a native2ascii on the properties and 
using them doesn't.  Using Shift_JIS encoding in .jsp and/or properties also
doesn't pass through correctly.

What I'm not sure about, though, is if the Reader used in getting the 
properties is making the coversion to Unicode or not.  I was under the 
impression that it was simply a ResourceBundle underneith - which has always
worked properly with native2ascii-ed files before.  This is an area that I
keep wanting to look into further - if I can find the time.

> Doesn't Stuts use the pageEncoding value of the .jsp file to determine the
> character encoding of input strings, and also convert output strings  from
> Unicode to the encoding specified in the form used in a response?

I experimented a bit, but couldn't find any evidence to support this.  With
Japanese it's especially difficult, since there are multiple character sets
possible for the ja_JP locale.  For input strings, I usually use the Japanese
"autodetect" encoding.  But I haven't found how to specify the output 
encoding.

With 2.0 servlets, I would just take the output stream and apply an encoding
to it by hand (defined in a properties file).  But with Struts, it doesn't
appear that one can change the encoding of the output in mid-stream - so I
haven't been able to have any control over it.  And, yes, I've tried defining
the encoding in the xml header at the top of the jsp pages.  That doesn't
appear to make any difference.

I'm really not sure if this is a Struts problem or a Tomcat (or whatever
application server you're using) problem.

What I'm currently thinking is that I'll have to chain an encoding filter
to the Writer in order to gain any control over its output.  But if the
Writer is already doing a coversion, I'm afraid that it'll all get garbled
in the end.

For the time being, encoding everything (properties and jsps) in EUC works.
But it's something I'd like to understand better before getting to the point
where I have to ship something.

--
Michael Westbay
Work: Beacon-IT http://www.beacon-it.co.jp/
Home:           http://www.seaple.icc.ne.jp/~westbay
Commentary:     http://www.japanesebaseball.com/