You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tomcat.apache.org by Tony LaPaso <tl...@comcast.net> on 2003/11/11 05:15:56 UTC

TC 5.0.14 Breaks UTF-8 Content via HTTP Header

Hi everyone,

It seems a change to TC v5.0.14 may have broken the serving of UTF-8
documents. Specifically, one of the HTTP headers seems wrong. I'd like to
describe what I'm seeing TC v5.0.14 compared with v5.0.12.

For both v5.0.12 and v5.0.14 I'm running TC as it comes "out of the box"
i.e., with no changes to the default configurations.

In both cases I tested with four browsers (IE 5, IE 6, Netscape 7.1 and
Firebird 0.7), all on Win 2K.


Here's What I Did
-----------------
In both versions of TC, I added an "em dash" character to the
"/tomcat-docs/cgi-howto.html" documents that come with the TC documentation.
The UTF-8 representation for the "em dash" character is the three bytes
0xE28094. I also made sure both documents had the following META tag in its
<head>:

<meta http-equiv='Content-Type' content='text/html; charset=utf-8'/>

I then saved the documents as UTF-8 (without a BOM). Finally, I brought the
document into a hex editor to check that the em dash was properly encoded as
three bytes (which it was). This indicated to me that the document was
indeed encoded as UTF-8.


Here's What I Saw (TC v5.0.12)
------------------------------
Under TC v5.0.12, everything looked great using all browsers -- the "em
dash" was rendered correctly. I put a sniffer on the HTTP stream. The
v5.0.12 Coyote Connector was sending this HTTP response header:
Content-Type: text/html


Here's What I Saw (TC v5.0.14)
------------------------------
Under TC v5.0.14 the "em dash" character was rendered as *THREE SEPARATE
CHARACTERs* (one for each byte). Moreover, putting a sniffer on the HTTP
stream indicated the following response header was being sent by the v5.0.14
Coyote Connector:
Content-Type: text/html;charset=ISO-8859-1


Aside
-----
For the heck of it I re-saved the v5.0.14 UTF-8 document with a BOM
(0xEFBBBF). Doing this made IE correctly render it but Netscape and Firebird
still had problems. I'm pretty sure that Unicode says the BOM is optional
anyway.


Conclusion (?)
--------------
It seems that v5.0.14 of the Coyote Connector is incorrectly sending the
wrong response header. I'm not sure what the HTTP spec says *should* be sent
for the header if the document's <head> contains:

<meta http-equiv='Content-Type' content='text/html; charset=utf-8'/>

My guess is that either the response header in v5.0.14 needs to be changed
to:
Content-Type: text/html;charset=UTF-8

or possibly:

Content-Type: text/html

as it was with TC v5.0.12.

Can anyone comment? Is this a TC v5.0.14 bug? It would seem to be.

Thanks,

Tony






---------------------------------------------------------------------
To unsubscribe, e-mail: tomcat-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: tomcat-dev-help@jakarta.apache.org


Re: TC 5.0.14 Breaks UTF-8 Content via HTTP Header

Posted by Nikola Milutinovic <Ni...@ev.co.yu>.
Tony LaPaso wrote:

> Here's What I Did
> -----------------
> In both versions of TC, I added an "em dash" character to the
> "/tomcat-docs/cgi-howto.html" documents that come with the TC documentation.
> The UTF-8 representation for the "em dash" character is the three bytes
> 0xE28094. I also made sure both documents had the following META tag in its
> <head>:
> 
> <meta http-equiv='Content-Type' content='text/html; charset=utf-8'/>

This constitutes a correct HTML document, with respect to the actual and 
announced document encoding.

> Here's What I Saw (TC v5.0.14)
> ------------------------------
> Under TC v5.0.14 the "em dash" character was rendered as *THREE SEPARATE
> CHARACTERs* (one for each byte). Moreover, putting a sniffer on the HTTP
> stream indicated the following response header was being sent by the v5.0.14
> Coyote Connector:
> Content-Type: text/html;charset=ISO-8859-1

First of all, was that a HTML or JSP? If it was JSP, then unless you specify 
your page encoding in JSP Page directive, Tomcat will and should use default 
encoding for HTTP headers.

Secondly, what is actually sent in TC 5.0.12 case?

> Conclusion (?)
> --------------
> It seems that v5.0.14 of the Coyote Connector is incorrectly sending the
> wrong response header. I'm not sure what the HTTP spec says *should* be sent
> for the header if the document's <head> contains:
> 
> <meta http-equiv='Content-Type' content='text/html; charset=utf-8'/>

This is part of HTML specification, which lets page author circumvent the HTTP 
header sent by the server. All clients are invited (but not forced) to follow 
<meta> tags, instead of HTTP headers.

For static content, like HTML pages, you cannot specify page encoding, other 
than default, on the fly. For dynamic content, like JSP, you have JSP Page 
directive in which to do it, like this:

<%@ page
   info="A test page"
   contentType="text/html; charset=utf-8"
%>

Nix.


---------------------------------------------------------------------
To unsubscribe, e-mail: tomcat-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: tomcat-user-help@jakarta.apache.org


Re: TC 5.0.14 Breaks UTF-8 Content via HTTP Header

Posted by Bill Barker <wb...@wilshire.com>.
See inline.

----- Original Message ----- 
From: "Tony LaPaso" <tl...@comcast.net>
To: <to...@jakarta.apache.org>; <to...@jakarta.apache.org>
Sent: Monday, November 10, 2003 8:15 PM
Subject: TC 5.0.14 Breaks UTF-8 Content via HTTP Header


> Hi everyone,
>
> It seems a change to TC v5.0.14 may have broken the serving of UTF-8
> documents. Specifically, one of the HTTP headers seems wrong. I'd like to
> describe what I'm seeing TC v5.0.14 compared with v5.0.12.
>
> For both v5.0.12 and v5.0.14 I'm running TC as it comes "out of the box"
> i.e., with no changes to the default configurations.
>
> In both cases I tested with four browsers (IE 5, IE 6, Netscape 7.1 and
> Firebird 0.7), all on Win 2K.
>
>
> Here's What I Did
> -----------------
> In both versions of TC, I added an "em dash" character to the
> "/tomcat-docs/cgi-howto.html" documents that come with the TC
documentation.
> The UTF-8 representation for the "em dash" character is the three bytes
> 0xE28094. I also made sure both documents had the following META tag in
its
> <head>:
>
> <meta http-equiv='Content-Type' content='text/html; charset=utf-8'/>
>
> I then saved the documents as UTF-8 (without a BOM). Finally, I brought
the
> document into a hex editor to check that the em dash was properly encoded
as
> three bytes (which it was). This indicated to me that the document was
> indeed encoded as UTF-8.
>
>
> Here's What I Saw (TC v5.0.12)
> ------------------------------
> Under TC v5.0.12, everything looked great using all browsers -- the "em
> dash" was rendered correctly. I put a sniffer on the HTTP stream. The
> v5.0.12 Coyote Connector was sending this HTTP response header:
> Content-Type: text/html
>
>
> Here's What I Saw (TC v5.0.14)
> ------------------------------
> Under TC v5.0.14 the "em dash" character was rendered as *THREE SEPARATE
> CHARACTERs* (one for each byte). Moreover, putting a sniffer on the HTTP
> stream indicated the following response header was being sent by the
v5.0.14
> Coyote Connector:
> Content-Type: text/html;charset=ISO-8859-1
>
>
> Aside
> -----
> For the heck of it I re-saved the v5.0.14 UTF-8 document with a BOM
> (0xEFBBBF). Doing this made IE correctly render it but Netscape and
Firebird
> still had problems. I'm pretty sure that Unicode says the BOM is optional
> anyway.
>
>
> Conclusion (?)
> --------------
> It seems that v5.0.14 of the Coyote Connector is incorrectly sending the
> wrong response header. I'm not sure what the HTTP spec says *should* be
sent
> for the header if the document's <head> contains:

The spec says nothing about META tags.  Tomcat (correctly) treats then as
just so much output text.

>
> <meta http-equiv='Content-Type' content='text/html; charset=utf-8'/>
>
> My guess is that either the response header in v5.0.14 needs to be changed
> to:
> Content-Type: text/html;charset=UTF-8
>
> or possibly:
>
> Content-Type: text/html
>
> as it was with TC v5.0.12.
>
> Can anyone comment? Is this a TC v5.0.14 bug? It would seem to be.

It looks like a 5.0.12 bug, that was subsequently fixed :).  The 2.4
Servlet-spec clearly states:
<spec-quote version="Servlet-2.4-pfd3" section="14.2.22">
If no character encoding has been specified, ISO-8859-1
is returned.
</spec-quote>

>
> Thanks,
>
> Tony
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: tomcat-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: tomcat-dev-help@jakarta.apache.org
>