You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@tomcat.apache.org by lightbulb432 <ve...@hotmail.com> on 2007/07/05 19:22:45 UTC

Character encoding

Why is the URIEncoding attribute specified on the connector rather than on a
host, for example? Does this mean that the number of virtual hosts that can
listen on the same port on the same box are limited by whether they all use
the same encodings in their URIs? Now that I think about it, wouldn't it be
at the context level, not even at the host level?

In Tomcat 6, should the useBodyEncodingForURI be used if not needing
compatibility with 4.1, as the documentation mentions? 

To see if I have things straight, is HttpServletRequest's
get/setCharacterEncoding used for both the request parameters from a GET
request AND the contents of the POST? How are multipart POST requests dealt
with?

And HttpServletResponse's get/setCharacterEncoding is used for the contents
of the response header and the meta tags? Does it also encode the page
content itself? 

What about the encoding of cookies for both incoming requests and outgoing
responses?

Thanks.
-- 
View this message in context: http://www.nabble.com/Character-encoding-tf4031134.html#a11450938
Sent from the Tomcat - User mailing list archive at Nabble.com.


---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org

Re: [OT] Re: Character encoding

Posted by Christopher Schultz <ch...@christopherschultz.net>.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

lightbulb,

lightbulb432 wrote:
>> POST requests always use the request's "body" encoding, which is
>> specified in the HTTP header (and can be overridden by using 
>> request.setCharacterEncoding). Some broken clients don't provide 
>> the character encoding of the request, which makes things difficult
>> sometimes.
> 
> What determines what's specified in the HTTP header for the value of the
> encoding?

Well... it's a bit of a chicken-in-an-egg scenario, since the encoding
specified in the header must match the encoding actually used in the
request. So, you could either decide that the header should match the
content or the content should match the header.

> Is it purely up to the user agent, or can Tomcat provide hints
> based on previous requests how to encode it - or is it something up to the
> end user to set in their browser (in IE, View -> Encoding)?

Typically, the default encoding used by the user-agent will be
locale-specific. For instance, most browsers in the US will use
ISO-8859-1 as the default locale, or maybe WINDOWS-1252 if you're
unlucky. Ideally, the server should be able to accept all reasonable
encodings. The "Accept-Charset" header sent by the user-agent to the
server indicates the acceptable encodings that should be returned, rated
by acceptability. For instance, my en_US Mozilla Firefox on Windows
sends this Accept-Charset string to servers:

Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7

This indicates that the browser would prefer ISO-8859-1 encoding, but
will also accept UTF-8 as a second choice, but that anything will do
('*') if those two are unavailable.

On HTML <form> elements, you may override the encoding used to send the
data:

<form accept-charset="UTF-8">

The HTML 4 specification says this about the accept-charset attribute:
"The default value for this attribute is the reserved string "UNKNOWN".
User agents may interpret this value as the character encoding that was
used to transmit the document containing this FORM element."
(http://www.w3.org/TR/html4/interact/forms.html#h-17.3)

So, if the server sends a document using UTF-8, it is "polite" for the
user-agent to use that same encoding to respond to the server if the
server hasn't indicated any preference by using the accept-charset
<form> attribute.

> In what cases would you call request.setCharacterEncoding to override the
> value specified by the user agent?

You should only do this when the user-agent does not declare the charset
being used in the body of the request through the Content-Type request
header. You should also only do this when you are relatively confident
that the user-agent is sending the data in the overridden character set.

For instance, if you suspect that most browsers adhere to the W3C's
recommendation above that an UNKNOWN accept-charset implies that the
browser should respond to the server with the same charset as used in
the previous server response (got all that?), and you always use the
same charset to send pages (say, UTF-8), they it is reasonable to
override any unspecified Content-Type encoding with the charset you use
to send pages (UTF-8, in this case).

The HTTP specification has this to say about missing charsets (in
Content-Type headers):
"  The "charset" parameter is used with some media types to define the
   character set (section 3.4) of the data. When no explicit charset
   parameter is provided by the sender, media subtypes of the "text"
   type are defined to have a default charset value of "ISO-8859-1" when
   received via HTTP. Data in character sets other than "ISO-8859-1" or
   its subsets MUST be labeled with an appropriate charset value. See
   section 3.4.1 for compatibility problems."
(http://www.ietf.org/rfc/rfc2616.txt Section 3.7.1)

Basically, this says that a missing charset within a Content-Type header
means that the request should be interpreted as being encoded using
ISO-8859-1 encoding. Pretty simple.

> Shouldn't you trust the user agent rather
> than trying to guess? (Or is this only used in cases where the user agent is
> "broken", like you said - but then how would you know you're dealing with a
> broken client to begin with...aah, complicated!)

You should /always/ respect the charset sent by the client. In fact, the
HTTP spec says so:
"HTTP/1.1 recipients MUST respect the charset label provided by the sender;"
(http://www.ietf.org/rfc/rfc2616.txt Section 3.4.1)

If the client sends the wrong charset, it's their fault that their data
will get all screwed up.

But, if there's no charset, then you should provide your own. The
default charset should be ISO-8859-1. I think Tomcat uses the default
encoding of the JVM if no charset is provided, which is a problem for
folks who set the JVM encoding to UTF-8 for i18n purposes... because
then the default becomes UTF-8 which is incorrect. Fortunately, UTF-8
and ISO-8859-1 are compatible for most common lower ASCII characters.
This has lead to a lot of folks thinking that they have their servers
configured correctly because it "looks like it works", but will fail for
things such as accented Latin characters, etc.

> What do you mean by this? Does it mean (pardon the surely messed up use of
> the API below) in your response.addCookie(), you add a cookie where the
> value has cookie.setValue(new String(charByteArray,"UTF-8")) then you read
> it back using responseCookie.getValue().getBytes("UTF-8")? (Where UTF-8 is
> whatever encoding you're using internally in your application.)

Unless you are working with binary data, you shouldn't be using byte
arrays: you should be using Strings. If you are putting binary data into
a cookie, you should probably be encoding it using a reasonable
binary-encoding scheme, such as base64, or even ascii-encoded-binary
(0102030405060708090a0b0c0d0e... that kind of thing... not sure if
that's an official term). HTML is always text, and your headers should
not be in binary. If you check out how the WWW-Authenticate header
works, you'll see that they use base64 to encode the binary data that is
sent over the wire. Then, you don't have to worry about what charset
you're using. The response object already knows what encoding to use and
when.

Don't forget that the HTTP headers are not part of the request or
response. They are defined to be in ASCII, as far as I can tell. So, if
you're using some odd charset like UTF-16, the headers are still
expressed in good-old single-byte characters, even though the body will
be using two-byte characters.

> Finally, what's the default encoding used by the response when
> response.setCharacterEncoding(myEncoding) isn't called?

That depends. The server will pick an encoding that makes sense. I would
imagine that if the client sent an Accept-Charset header that was
compatible with the default encoding of the JVM, then that charset will
be used. Other than that, I have no idea.

> Am I correct to
> assume that if that default is not the default Java String encoding of
> UTF-16, then you MUST call convert all the Strings you've outputted to that
> encoding? (...because the HTTP header expects whatever the default is, but
> Java is outputting UTF-16 encoded text to the actual response bytes)

Just to note, Java uses UTF-16 internally to store char values. That
doesn't mean that it's the "default encoding" for Java. The default
encoding for the JVM is, in fact, settable by the user. You can read
that value from the system property "file.encoding".

Tomcat (properly) uses java.io.Writer objects when writing character
data to HTTP responses. Look at the javadoc for
HttpServletRequest.getWriter():

http://tomcat.apache.org/tomcat-5.5-doc/servletapi/javax/servlet/ServletResponse.html#getWriter()

"Returns a PrintWriter object that can send character text to the
client. The PrintWriter uses the character encoding returned by
getCharacterEncoding(). If the response's character encoding has not
been specified as described in getCharacterEncoding  (i.e., the method
just returns the default value ISO-8859-1), getWriter  updates it to
ISO-8859-1."

So, the servlet specification sets the default character set to
ISO-8859-1, which is inconvenient for users of non-Latin character sets.
That means that, if you want to use something else, you should set the
character encoding /before/ any call to getWriter occurs. I recommend
UTF-8 as I think it should cover all unicode characters but also uses
fewer bytes when you are sending regular Latin characters, which is nice.

> P.S. How did you learn all of that?!

Experience. Most of the references I just looked up on the spot, because
I know where to find them. I don't have all those quotes in my brain ;)

- -chris

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGkl9h9CaO5/Lv0PARAmgWAJ9nq0dDw8HUksc5TCDh5odprw858wCgq9OY
FxtYQxqzuqjwm/OsKm2mvAM=
=1zKK
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org

[OT] Re: Character encoding

Posted by lightbulb432 <ve...@hotmail.com>.

That was a really great set of answers, thanks! These follow-ups are somewhat
off-topic to Tomcat, but you really know this stuff well so I hope you don't
mind addressing them:

POST requests always use the request's "body" encoding, which is specified
in 
> the HTTP header (and can be overridden by using 
> request.setCharacterEncoding). Some broken clients don't provide the 
> character encoding of the request, which makes things difficult sometimes.

What determines what's specified in the HTTP header for the value of the
encoding? Is it purely up to the user agent, or can Tomcat provide hints
based on previous requests how to encode it - or is it something up to the
end user to set in their browser (in IE, View -> Encoding)?

In what cases would you call request.setCharacterEncoding to override the
value specified by the user agent? Shouldn't you trust the user agent rather
than trying to guess? (Or is this only used in cases where the user agent is
"broken", like you said - but then how would you know you're dealing with a
broken client to begin with...aah, complicated!)

You shouldn't have to worry about cookie encoding, since you can always
> call request.getCookies() and get them "correctly" interpreted for you.

What do you mean by this? Does it mean (pardon the surely messed up use of
the API below) in your response.addCookie(), you add a cookie where the
value has cookie.setValue(new String(charByteArray,"UTF-8")) then you read
it back using responseCookie.getValue().getBytes("UTF-8")? (Where UTF-8 is
whatever encoding you're using internally in your application.)

Finally, what's the default encoding used by the response when
response.setCharacterEncoding(myEncoding) isn't called? Am I correct to
assume that if that default is not the default Java String encoding of
UTF-16, then you MUST call convert all the Strings you've outputted to that
encoding? (...because the HTTP header expects whatever the default is, but
Java is outputting UTF-16 encoded text to the actual response bytes)

Am I speaking rubbish here, or am I thinking about these concepts in the
right way?

Thanks a lot.

P.S. How did you learn all of that?!

Christopher Schultz-2 wrote:
> 
> Lightbulb,
> 
> lightbulb432 wrote:
>> Why is the URIEncoding attribute specified on the connector rather than
>> on a
>> host, for example?
> 
> Because the host doesn't handle connections... the connectors do.
> 
>> Does this mean that the number of virtual hosts that can
>> listen on the same port on the same box are limited by whether they all
>> use
>> the same encodings in their URIs?
> 
> Yes, all virtual hosts listening on the same port will have to have the
> same encoding. Fortunately, UTF-8 works for all languages that I know of.
> 
>> Now that I think about it, wouldn't it be
>> at the context level, not even at the host level?
> 
> If you had a connector-per-context, yes, but that's no the case.
> 
>> In Tomcat 6, should the useBodyEncodingForURI be used if not needing
>> compatibility with 4.1, as the documentation mentions? 
> 
> I would highly recommend following that recommendation.
> 
>> To see if I have things straight, is HttpServletRequest's
>> get/setCharacterEncoding used for both the request parameters from a GET
>> request AND the contents of the POST?
> 
> No. GET requests have request parameters encoded as part of the URL,
> which is affected by the <Connector>'s URIEncoding parameter. POST
> requests always use the request's "body" encoding, which is specified in
> the HTTP header (and can be overridden by using
> request.setCharacterEncoding). Some broken clients don't provide the
> character encoding of the request, which makes things difficult sometimes.
> 
>> How are multipart POST requests dealt with?
> 
> Typically, each part of a multipart request contains its own character
> encoding, so a multipart POST would follow the encoding for the part
> you're reading at the time.
> 
>> And HttpServletResponse's get/setCharacterEncoding is used for the
>> contents
>> of the response header and the meta tags?
> 
> Only for the header field, not META tags. If you want to emit META tags,
> you'll have to do them yourself.
> 
>> Does it also encode the page content itself? 
> 
> Nope. If you change the character encoding for a response after the
> response has already had some data written to it, I think you'll send an
> incorrect header. For instance:
> 
> response.setCharacterEncoding("ISO-8859-1");
> PrintWriter out = response.getOutputWriter();
> 
> response.setCharacterEncoding("Big5");
> 
> out.print("abcdef");
> out.flush();
> 
> Your client will not receive a sane response. Setting the character
> encoding only sets the HTTP response header and configures the
> response's Writer, if used, but only /before/ calling getWriter the
> first time.
> 
>> What about the encoding of cookies for both incoming requests and
>> outgoing
>> responses?
> 
> See the HTTP spec, section 4.2 ("Message Headers"). It references RFC
> 822 (ARPA Internet text messages) which does not actually specify a
> character encoding. From what I can see, low ASCII is the encoding used.
> You shouldn't have to worry about cookie encoding, since you can always
> call request.getCookies() and get them "correctly" interpreted for you.
> 
> -chris
> 
> 
>  
> 

-- 
View this message in context: http://www.nabble.com/Character-encoding-tf4031134.html#a11495606
Sent from the Tomcat - User mailing list archive at Nabble.com.

---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org

Re: Character encoding

Posted by Christopher Schultz <ch...@christopherschultz.net>.

Lightbulb,

lightbulb432 wrote:
> Why is the URIEncoding attribute specified on the connector rather than on a
> host, for example?

Because the host doesn't handle connections... the connectors do.

> Does this mean that the number of virtual hosts that can
> listen on the same port on the same box are limited by whether they all use
> the same encodings in their URIs?

Yes, all virtual hosts listening on the same port will have to have the
same encoding. Fortunately, UTF-8 works for all languages that I know of.

> Now that I think about it, wouldn't it be
> at the context level, not even at the host level?

If you had a connector-per-context, yes, but that's no the case.

> In Tomcat 6, should the useBodyEncodingForURI be used if not needing
> compatibility with 4.1, as the documentation mentions? 

I would highly recommend following that recommendation.

> To see if I have things straight, is HttpServletRequest's
> get/setCharacterEncoding used for both the request parameters from a GET
> request AND the contents of the POST?

No. GET requests have request parameters encoded as part of the URL,
which is affected by the <Connector>'s URIEncoding parameter. POST
requests always use the request's "body" encoding, which is specified in
the HTTP header (and can be overridden by using
request.setCharacterEncoding). Some broken clients don't provide the
character encoding of the request, which makes things difficult sometimes.

> How are multipart POST requests dealt with?

Typically, each part of a multipart request contains its own character
encoding, so a multipart POST would follow the encoding for the part
you're reading at the time.

> And HttpServletResponse's get/setCharacterEncoding is used for the contents
> of the response header and the meta tags?

Only for the header field, not META tags. If you want to emit META tags,
you'll have to do them yourself.

> Does it also encode the page content itself? 

Nope. If you change the character encoding for a response after the
response has already had some data written to it, I think you'll send an
incorrect header. For instance:

response.setCharacterEncoding("ISO-8859-1");
PrintWriter out = response.getOutputWriter();

response.setCharacterEncoding("Big5");

out.print("abcdef");
out.flush();

Your client will not receive a sane response. Setting the character
encoding only sets the HTTP response header and configures the
response's Writer, if used, but only /before/ calling getWriter the
first time.

> What about the encoding of cookies for both incoming requests and outgoing
> responses?

See the HTTP spec, section 4.2 ("Message Headers"). It references RFC
822 (ARPA Internet text messages) which does not actually specify a
character encoding. From what I can see, low ASCII is the encoding used.
You shouldn't have to worry about cookie encoding, since you can always
call request.getCookies() and get them "correctly" interpreted for you.

-chris