You are viewing a plain text version of this content. The canonical link for it is here.
Posted to infrastructure-issues@apache.org by "Konstantin Preißer (JIRA)" <ji...@apache.org> on 2013/11/07 20:19:17 UTC

[jira] [Created] (INFRA-6974) "Content-Type" header includes "charset=UTF-8" for some static file types

Konstantin Preißer created INFRA-6974:
-----------------------------------------

             Summary: "Content-Type" header includes "charset=UTF-8" for some static file types
                 Key: INFRA-6974
                 URL: https://issues.apache.org/jira/browse/INFRA-6974
             Project: Infrastructure
          Issue Type: Bug
          Components: HTTP Server
            Reporter: Konstantin Preißer
            Priority: Minor


Hi,

I noticed that the HTTP Server which serves Apache project websites like tomcat.apache.org automatically includes a "charset=UTF-8" field in the Content-Type header for static *.html files and for *.txt files, independently from the actual encoding of the file (Reference messages: [1] and [2]).

E.g., if you request http://tomcat.apache.org/ (static html page), then the Content-Type header will be:

Content-Type=text/html; charset=utf-8

Although I'm a fan of using UTF-8 for everything (especially for Web pages), and including a "charset" field in the Content-Type probably saves the browser some time as it doesn't need to find out the encoding from the file content, this means that some .html pages have conflicting encoding declarations, as not all .html pages on Apache Project websites are encoded as UTF-8.

E.g., for this page:
http://tomcat.apache.org/tomcat-6.0-doc/index.html
the Encoding in the Content-Type header says "UTF-8", but the encoding which is declared in the file content says "ISO-8859-1" which is the actual encoding of the file.

As the encoding from the HTTP Content-Type header takes precedence, browsers will interpret the file as UTF-8 instead of ISO-8859-1. This can mean that if the file contains non-ASCII characters (> 0x7F), a browser will display them incorrectly because of  the wrong encoding.
For the Tomcat 6.0 docs (linked above) this has no visible effect since they don't use non-ASCII characters directly but encode them as entity references or character references.

However, there are some pages where the conflicting encodings have effects, mostly such that decoding as UTF-8 fails:
1) http://commons.apache.org/proper/commons-dbcp/
2) http://commons.apache.org/proper/commons-attributes/

In the LHS menu of these 1), there is a <h5> element with text "Commons DBCP", but the space is actually a U+00A0 character (nbsp), encoded as 0xA0. As this is a non-ASCII character, browsers will fail to decode it when using UTF-8, so they display "�" (U+FFFD, Replacement Character) instead. If you manually change the encoding to ISO-8859-1 in the browser's menu, the page will be displayed correctly.

Additionally, there are some Apache sites with conflicting encodings (encoded as ISO-8859-1 but it gets overridden with UTF-8), which however doesn't seem to have visible effects:
1) http://jclouds.apache.org/
2) http://jmeter.apache.org/
3) http://perl.apache.org/
4) http://spamassassin.apache.org/
5) http://uima.apache.org/

So, I think that a "charset=UTF-8" parameter shouldn't be appended to Content-Type headers of static resources if one isn't sure that the encoding is really UTF-8, as there are still a number of static HTML pages which use ISO-8859-1 instead of UTF-8.

[1] http://markmail.org/message/ls473qxwtrcegyyo
[2] http://markmail.org/message/oe6re3xtkkwi24py



--
This message was sent by Atlassian JIRA
(v6.1#6144)