You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@tomcat.apache.org by Je suis la poubelle <la...@gmail.com> on 2009/04/02 19:30:28 UTC

Re: Tomcat 5 and UTF-8

On Fri, Mar 27, 2009 at 5:34 PM, Christopher Schultz <
chris@christopherschultz.net> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Oscar,
>
> On 3/27/2009 10:35 AM, Je suis la poubelle wrote:
> > 1. In those mentioned web pages, I noticed that none of them explicitly
> > specified the following HTML header:
> > <head>
> > <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
> > </head>
>
> That's because setting a META tag that doesn't match reality is not
> really a good idea. I can set the charset to shift-js in the META tag
> but it doesn't mean the page is actually in Japanese.


     I don't see your point....

    Setting charset/encoding is to specify computerized information.  It's
not just a matter of language.  If setting charset in META tag doesn't mean
anything to you, the same argument applies to setting charset in HTTP
header.


>  > And what if another encoding is specified in HTML header, say
> > ISO-8859-1?  Which one would the browser use in priority?  Nobody knows
> the
> > answer!
>
> Actually, everybody knows the answer, because it's published in the HTML
> specification: http://www.w3.org/TR/html4/charset.html#h-5.2.2
>
> "
> To sum up, conforming user agents must observe the following priorities
> when determining a document's character encoding (from highest priority
> to lowest):
>
>   1. An HTTP "charset" parameter in a "Content-Type" field.
>   2. A META declaration with "http-equiv" set to "Content-Type" and a
>      value set for "charset".
>   3. The charset attribute set on an element that designates an
>      external resource.
> "
>

     Yes, yes, but this is the theoretical answer.  Not in practice.  When
there's a bug, there's a bug.


>  > That's why I specify the encoding in both places.
>
> While it's not a terrible idea to specify the encoding in both places,
> you should consider the possibility that the META tag can be wrong.


     It's not only "not a terrible idea", but a good habit to do so.  Just
like the principle of "double check": what's the point of double check if
everything works as expected?  We do "double check" because in practice we
are subject to errors.

     A good programmer should never leave anything to chance, that's why
it's good to set charset in both HTTP header as well as in HTML header.


> > 2. To make things easier for myself, I always save JSP files in UTF-8
> > encoding, and I always put this header as well:
> > <%@ page pageEncoding="utf-8" %>
> > Now everything's in UTF8 from A to Z.
>
> If you're following guidelines for i18n, you'll put your non-ASCII
> strings into property files and won't have to worry about the encoding
> of the JSP source file.


     Yes, but again, you're talking in theoretical viewpoint.  What about if
I want to create a small, quick and short JSP just for some tests?  I won't
go into changing everything.  The simpler is to change one thing: my JSP
file.

     Another situation: what if you don't have total access to all files?
Well, if Tomcat is in your computer, it's taken as granted that you could do
everything.  But what about you're developping a JSP site and have it hosted
in some Internet servers?  Are you sure you still have all access?  And as
your link to HTTP recommendation says, some server might not send HTTP
header.  You'd better also set the charset in HTML header.

     One more example: you're doing the test in one JSP file in your
corner.  Everything works perfectly.  Then you move the file to another
server.  In this situation, it's better to have the file self-contained.


> > String sUTF8  = new String(sWrongEncoding.getBytes("iso-8859-1"),
> "UTF8");
>
> I think that should be "rightString", not "sUTF8", since the String
> object has no inherent encoding.


     Not true.  Java string is inherently using UTF-16.  If you're so picky
on the name, you'd better call it
latin1StringConvertedBackToUTF8BeforeConvertedBackToUTF16 .... but this is
getting ridiculous....

Re: Tomcat 5 and UTF-8

Posted by André Warnier <aw...@ice-sa.com>.
Gregor Schneider wrote:
> 
> And it's getting really nuts, when it comes to UTF-8: Talking about
> UTF-8 with or without BOM? Even the specs are not clear about that.
> 
Actually, a UTF-8 stream should /never/ need a BOM, because there is no 
byte-order, UTF-8 being by definition byte-oriented.
The only problem is that, for instance MS-Windows Notepad adds a BOM to 
any text file it saves as UTF-8.  Is anyone surprised ?

Another, linked issue is this :
If you edit and save as UTF-8 an html page using, for example, Notepad, 
it will always prefix the file with such a totally superfluous BOM.
If you later serve this page with Apache or Tomcat, to an Internet 
Explorer browser, using no matter which HTTP Content-Type + charset 
header, Internet Explorer will see the BOM and decide that this page is 
encoded in UTF-8, no matter what any meta tag in the page says.

> In my oppinion, the whole character-set is a pain in the ass:
I agree with that.

> 
> I personally wish IETF came up with some specs saying something like
> "the first n bytes of any stream have to be encoded in ASCII containg
> length and encoding-type of the rest of the stream".
I agree with that too, in general terms.
I believe that any file, any stream, should start with such a prefix, 
indicating at least the file's MIME type, charset and encoding (size may 
be unknown at that point), with a default of "text/plain", Unicode and 
UTF-8.
I also believe there should be a HTTP 2.0 specification, specifying in 
clear terms a default Unicode/UTF-8 encoding for URLs, html pages, form 
data submission and so on, and a non-ambiguous way of deviating from that.

The problem is in bringing this about.
> 
> I put that on my whishlist for xmas.
That's nice, but you would have to start by convicing Santa Klaus.


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: Tomcat 5 and UTF-8

Posted by Gregor Schneider <rc...@googlemail.com>.
On Thu, Apr 2, 2009 at 7:30 PM, Je suis la poubelle <la...@gmail.com> wrote:
> On Fri, Mar 27, 2009 at 5:34 PM, Christopher Schultz <
> chris@christopherschultz.net> wrote:
>
>
> Setting charset/encoding is to specify computerized information.  It's
> not just a matter of language.  If setting charset in META tag doesn't mean
> anything to you, the same argument applies to setting charset in HTTP
> header.
>

Well, this is the only argument I can agree upon.

But encoding of HTML/XML is the story of which was there first: The
hen or the egg?

I'll give you an example based on our dreadful experiences with XML-parsing:

Let's say, we have a stream looking like this:

<?xml version="1.0" encoding="UTF-8"?>
   <foo>bar</foo>
</xml>

However, the encoding of the whole stream is done in some wierd
encoding you've never heard about.

See, the parser needs to know about the encoding /in advance/ to be
able to read the encoding from said stream.

See the point?

Actually, it's a good practice to put the encoding, but that's about
it, and same goes for a META-TAG.

Talking web, the only thing a parser can rely on is a HTTP-Header.

And it's getting really nuts, when it comes to UTF-8: Talking about
UTF-8 with or without BOM? Even the specs are not clear about that.

In my oppinion, the whole character-set is a pain in the ass:

I personally wish IETF came up with some specs saying something like

"the first n bytes of any stream have to be encoded in ASCII containg
length and encoding-type of the rest of the stream".

I put that on my whishlist for xmas.

Rgds

Gregor
-- 
just because your paranoid, doesn't mean they're not after you...
gpgp-fp: 79A84FA526807026795E4209D3B3FE028B3170B2
gpgp-key available
@ http://pgpkeys.pca.dfn.de:11371
@ http://pgp.mit.edu:11371/

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: Tomcat 5 and UTF-8

Posted by Markus Schönhaber <to...@list-post.mks-mail.de>.
Christopher Schultz:

> The problem is when the web server sends a response, it sends it using a
> particular character set (let's just say UTF8 for argument's sake). If
> you also report that the character set is UTF8 in the META tags, then
> it's only valid if the client saves the file to the disk with that same
> character encoding. If a different encoding is being used, then the file
> is "lying" about its own encoding.

Agreed, a wrong charset in a meta tag is brain-dead.

Nevertheless, one should keep in mind the charset reported in meta tags
is in practice *not* *only* used when the file is read from disk.
Firefox 3, IE 7 and Opera use the charset from the meta tag if the
server doesn't add a charset to the value of the Content-Type response
header field (Tomcat's DefaultServlet doesn't, for example).

BTW: Although the behaviour of the browsers does comply with the HTML 4
W3C recommendation you cited in an earlier post, they, IMO, violate RFC
2616 (HTTP/1.1) which says in section 3.7.1 that a missing charset in
the Content-Type header field means that the content is encoded in
ISO-8859-1 (for text/* media types, of course) - and the DefaultServlet
violates the RFC too.

Regards
  mks

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: Tomcat 5 and UTF-8

Posted by Joseph Millet <jo...@gmail.com>.
Not as much unrelated to the topic that my interventoion was - sorry didn't
see it had already been addressed.

On Mon, Apr 6, 2009 at 10:00 PM, Chris Lenart <cl...@comcast.net> wrote:

> I am using Tomcat 6.0.18 with Eclipse. It says the port is being used.
> Where do I change?
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> For additional commands, e-mail: users-help@tomcat.apache.org
>
>

Re: Tomcat 5 and UTF-8

Posted by Joseph Millet <jo...@gmail.com>.
Something more to consider relating to specifying charsets in meta tags :
It's of course ok that a server sends http headers specifying in what
charset the page is encoded but when the user comes to saving that web page
on local drive there's nothing left that meta tags to inform browsers the
page is encoded in utf-8. So depending on wether your app's users are likely
to save locally served contents it might be a good thing to also consider
putting this piece of information directly into the html doc.

On Mon, Apr 6, 2009 at 8:56 PM, Christopher Schultz <
chris@christopherschultz.net> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> André,
>
> On 4/3/2009 6:43 PM, André Warnier wrote:
>
> > 4) the HTML specs are distinct from the HTTP specs.  [...] It also
> > seems to be superfluous and confusing considering (1) and (2) above.
> > (Like, what if (1) and (4) specify different charsets/encodings ?).
>
> The HTML spec says that the Content-Type wins the argument ;)
>
> > Why hasn't a proposal for HTTP 2.x / HTML 5.x
> > come about, reconciling those aspects and establishing Unicode/UTF-8 as
> > the default (or only) encoding, for URLs as well as content ?
>
> http://dev.w3.org/html5/spec/Overview.html
>
> > 8) What is also missing in my view, is some more general proposal
> > covering the format of text files (and text streams), anywhere.  To
> > alleviate any ambiguity, each text file/stream should contain at least a
> > short prefix indicating its MIME type and its charset/encoding.
>
> I always thought this was a good idea:
> http://en.wikipedia.org/wiki/Resource_fork
>
> > All the above is why I keep on seeing my name echoed back to me as
> > André, even on some well-known supposedly international websites.
>
> :(
>
> - -chris
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.9 (MingW32)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
>
> iEYEARECAAYFAknaUF8ACgkQ9CaO5/Lv0PAqHACfbVa11mljL/B7U6oLkkUI5/8k
> 3uEAn3wEFNHyTYjdYLAey+gbzffz1Vv6
> =5qZ1
> -----END PGP SIGNATURE-----
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> For additional commands, e-mail: users-help@tomcat.apache.org
>
>

RE: Tomcat 5 and UTF-8

Posted by Chris Lenart <cl...@comcast.net>.
I am using Tomcat 6.0.18 with Eclipse. It says the port is being used.
Where do I change? 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: Tomcat 5 and UTF-8

Posted by Christopher Schultz <ch...@christopherschultz.net>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

André,

On 4/3/2009 6:43 PM, André Warnier wrote:

> 4) the HTML specs are distinct from the HTTP specs.  [...] It also
> seems to be superfluous and confusing considering (1) and (2) above.
> (Like, what if (1) and (4) specify different charsets/encodings ?).

The HTML spec says that the Content-Type wins the argument ;)

> Why hasn't a proposal for HTTP 2.x / HTML 5.x
> come about, reconciling those aspects and establishing Unicode/UTF-8 as
> the default (or only) encoding, for URLs as well as content ?

http://dev.w3.org/html5/spec/Overview.html

> 8) What is also missing in my view, is some more general proposal
> covering the format of text files (and text streams), anywhere.  To
> alleviate any ambiguity, each text file/stream should contain at least a
> short prefix indicating its MIME type and its charset/encoding.

I always thought this was a good idea:
http://en.wikipedia.org/wiki/Resource_fork

> All the above is why I keep on seeing my name echoed back to me as
> André, even on some well-known supposedly international websites.

:(

- -chris
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAknaUF8ACgkQ9CaO5/Lv0PAqHACfbVa11mljL/B7U6oLkkUI5/8k
3uEAn3wEFNHyTYjdYLAey+gbzffz1Vv6
=5qZ1
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: Tomcat 5 and UTF-8

Posted by André Warnier <aw...@ice-sa.com>.
Hi.

One of my preferred subjects...

1) as per the HTTP specs, the server should send a Content-Type header 
along with any response to a browser.  If the response is of the general 
type "text", then this Content-Type header should also contain a charset 
attribute, indicating the character set and the encoding.
If not indicated, this defaults to iso-8859-1 (which is a charset and an 
8-bit encoding).
Apache and Tomcat normally do that, but a badly-written application can 
  override that and screw things up.  There are also cases where Apache 
and Tomcat genuinely do not know, as when picking up a file from disk, 
and have to pick either the default iso-8859-1 or what their 
configuration specifies as a default.
Of course this is sometimes wrong.

2) also per the HTTP specs, when the server sends a Content-Type header, 
the client (browser) should not second-guess the server. It should 
accept and respect the header in order to interpret the content.
Major discrepancy : all versions of IE which I know of second-guess the 
server, in clear violation of the HTTP specs, and make their own 
inspection and heuristic determination of the content received, and 
unfortunately they get it wrong in a number of cases.  Unfortunately 
also, since IE still accounts for over 90% of the browsers used in 
corporate environments, the poor webapp programmer is forced to take 
this bad behaviour into account.

3) If the server sends back a document prefixed by a BOM, then IE also 
automatically interprets the documents as being Unicode, no matter what 
the server (or the document) say.  This is stupid because a UTF-8 
encoded document does not need a BOM, considering it is a byte-oriented 
encoding anyway, with no possibility of getting a byte-order wrong.
Windows Notepad saves all Unicode documents with a BOM, even when saving 
them as UTF-8.

4) the HTML specs are distinct from the HTTP specs.  In the HTML specs, 
there exists a <meta HTTP-equiv="Content-Type" ..> tag, which supposedly 
can contain a charset indication about the content of this HTML page.
I personally find this rather clumsy, because the client has to start 
reading and decoding the HTML document before it can read and interpret 
this header, so its real practical significance is doubtful.  It also 
seems to be superfluous and confusing considering (1) and (2) above. 
(Like, what if (1) and (4) specify different charsets/encodings ?).
But ok, it might be of some use for HTML editors, which could use this 
to try to interpret correctly a document loaded from disk, in which case 
there is no Content-Type sent by a server.

5) as well the HTTP specs as the HTML specs, are still not entirely 
precise nor unambiguous about some aspects of the general character set 
issues. For example, when a POST request contains data encoded as 
"URL-encoded".  Also, even modern browsers (including Firefox 3) do not 
properly specify the encoding of multi-part POSTs.

6) encoding rules are different for the URLs, for the HTTP headers, and 
for the content.  Even a URL has two distinct types of encoding : the 
part for the hostname (Punycode, rfc 3492), and the part for the path 
and query-string (charset unspecified, percent-encoding).

7) It never ceases to amaze me, the amount of productive time lost every 
year with character set issues on the web, when Unicode/UTF-8 has been 
around for several years as a charset/encoding covering all languages 
known to man and beyond.  Why hasn't a proposal for HTTP 2.x / HTML 5.x 
come about, reconciling those aspects and establishing Unicode/UTF-8 as 
the default (or only) encoding, for URLs as well as content ?

8) What is also missing in my view, is some more general proposal 
covering the format of text files (and text streams), anywhere.  To 
alleviate any ambiguity, each text file/stream should contain at least a 
short prefix indicating its MIME type and its charset/encoding.

All the above is why I keep on seeing my name echoed back to me as 
André, even on some well-known supposedly international websites.


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: Tomcat 5 and UTF-8

Posted by Christopher Schultz <ch...@christopherschultz.net>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Oscar,

On 4/2/2009 1:30 PM, Je suis la poubelle wrote:
> On Fri, Mar 27, 2009 at 5:34 PM, Christopher Schultz <ch...@christopherschultz.net> wrote:
>> While it's not a terrible idea to specify the encoding in both places,
>> you should consider the possibility that the META tag can be wrong.
> 
>      It's not only "not a terrible idea", but a good habit to do so.  Just
> like the principle of "double check": what's the point of double check if
> everything works as expected?  We do "double check" because in practice we
> are subject to errors.

The problem is when the web server sends a response, it sends it using a
particular character set (let's just say UTF8 for argument's sake). If
you also report that the character set is UTF8 in the META tags, then
it's only valid if the client saves the file to the disk with that same
character encoding. If a different encoding is being used, then the file
is "lying" about its own encoding.

> A good programmer should never leave anything to chance, that's why
> it's good to set charset in both HTTP header as well as in HTML header.

I've never used META to set the content type of a web page, and things
seem to be working just fine over here. <shrug>

- -chris
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAknWgYUACgkQ9CaO5/Lv0PA8XwCfQH4iBwSY/6Pl4OjTUYA14e/Y
09YAn2iaagNDRrVysIZAqzWcY6MCM2yD
=MFJ5
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org