You are viewing a plain text version of this content. The canonical link for it is here.
Posted to infrastructure-issues@apache.org by "Daniel Gruno (JIRA)" <ji...@apache.org> on 2014/09/05 13:13:28 UTC

[jira] [Resolved] (INFRA-6974) "Content-Type" header includes "charset=UTF-8" for some static file types

     [ https://issues.apache.org/jira/browse/INFRA-6974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Gruno resolved INFRA-6974.
---------------------------------
    Resolution: Fixed
      Assignee: Daniel Gruno

Content encoding default has been disabled now, let's see how that goes.

> "Content-Type" header includes "charset=UTF-8" for some static file types
> -------------------------------------------------------------------------
>
>                 Key: INFRA-6974
>                 URL: https://issues.apache.org/jira/browse/INFRA-6974
>             Project: Infrastructure
>          Issue Type: Bug
>          Components: HTTP Server
>            Reporter: Konstantin Preißer
>            Assignee: Daniel Gruno
>            Priority: Minor
>              Labels: #bugbash
>
> Hi,
> I noticed that the HTTP Server which serves Apache project websites like tomcat.apache.org automatically includes a "charset=UTF-8" field in the Content-Type header for static *.html files and for *.txt files, independently from the actual encoding of the file (Reference messages: [1] and [2]).
> E.g., if you request http://tomcat.apache.org/ (static html page), then the Content-Type header will be:
> Content-Type=text/html; charset=utf-8
> Although I'm a fan of using UTF-8 for everything (especially for Web pages), and including a "charset" field in the Content-Type probably saves the browser some time as it doesn't need to find out the encoding from the file content, this means that some .html pages have conflicting encoding declarations, as not all .html pages on Apache Project websites are encoded as UTF-8.
> E.g., for this page:
> http://tomcat.apache.org/tomcat-6.0-doc/index.html
> the Encoding in the Content-Type header says "UTF-8", but the encoding which is declared in the file content says "ISO-8859-1" which is the actual encoding of the file.
> As the encoding from the HTTP Content-Type header takes precedence, browsers will interpret the file as UTF-8 instead of ISO-8859-1. This can mean that if the file contains non-ASCII characters (> 0x7F), a browser will display them incorrectly because of  the wrong encoding.
> For the Tomcat 6.0 docs (linked above) this has no visible effect since they don't use non-ASCII characters directly but encode them as entity references or character references.
> However, there are some pages where the conflicting encodings have effects, mostly such that decoding as UTF-8 fails:
> 1) http://commons.apache.org/proper/commons-dbcp/
> 2) http://commons.apache.org/proper/commons-attributes/
> In the LHS menu of these 1), there is a <h5> element with text "Commons DBCP", but the space is actually a U+00A0 character (nbsp), encoded as 0xA0. As this is a non-ASCII character, browsers will fail to decode it when using UTF-8, so they display "�" (U+FFFD, Replacement Character) instead. If you manually change the encoding to ISO-8859-1 in the browser's menu, the page will be displayed correctly.
> Additionally, there are some Apache sites with conflicting encodings (encoded as ISO-8859-1 but it gets overridden with UTF-8), which however doesn't seem to have visible effects:
> 1) http://jclouds.apache.org/
> 2) http://jmeter.apache.org/
> 3) http://perl.apache.org/
> 4) http://spamassassin.apache.org/
> 5) http://uima.apache.org/
> So, I think that a "charset=UTF-8" parameter shouldn't be appended to Content-Type headers of static resources if one isn't sure that the encoding is really UTF-8, as there are still a number of static HTML pages which use ISO-8859-1 instead of UTF-8.
> [1] http://markmail.org/message/ls473qxwtrcegyyo
> [2] http://markmail.org/message/oe6re3xtkkwi24py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)