You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tomcat.apache.org by bu...@apache.org on 2019/12/04 17:57:12 UTC

[Bug 63985] Tomcat 9 does not read UTF-8 files with no bom correctly

https://bz.apache.org/bugzilla/show_bug.cgi?id=63985

Mark Thomas <ma...@apache.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 OS|                            |All
             Status|NEW                         |RESOLVED
         Resolution|---                         |WONTFIX

--- Comment #1 from Mark Thomas <ma...@apache.org> ---
Thanks for the report.

The issue is not how Tomcat reads the files. The bytes presented to the user
agent are exactly the bytes that are on disk.

The issue is that the Default Servlet does not add a content-type and charset.

In the BOM case the user agent reads the BOM and renders the bytes as UTF-8.
In the non-BOM case the user agent renders the bytes as ISO-8859-1.

The immediate solution is to add:
<head>
<meta charset="utf-8"/>
</head>
to the HTML page with no BOM. This allows the user agent to do the right thing.

Whether Tomcat should do anything about this is debatable. This is probably a
discussion for the dev list. I'm going to resolve this as WONTFIX but I'll try
and set out the key points here as a starter for a dev list discussion, should
one be required.

This only applies to text files served by the default servlet.

There are multiple encodings in play here:
1. The encoding the text file has been saved with.
2. The encoding declared within the file (if any).
3. The fileEncoding init param for the Default servlet.
4. The default encoding configured for the web application (if any).
5. The encoding of the resource the static resource is being included in (if
this is an include).
6. The default character encoding (ISO-8859-1) as defined by the Servlet spec.
7. Any explicit encoding declared for the request (e.g. by a filter)

The various encodings above are not always consistent. In this instance user
agents will generally prioritise explicit encodings in the HTTP headers, then
encodings in the file.

Because 3 is per web application (it is typically per server but it can be per
web application) and multiple values for 1 within a single web application is
fairly common, Tomcat tries to do as little as possible on the assumption the
user agent will be able to figure out the right thing to do from the file in
most cases. This is why it is a good idea to declare encodings in files where
the file format supports this.

Experience to date is that it breaks more things than it fixes to have Tomcat
set an explicit encoding. That may change at some point as everyone shifts to
UTF-8 everywhere. I'm not sure we are there yet.

If all of the following are true, Tomcat will attempt to convert the bytes from
the input file:
- The requested resource is a text file
- An explicit character encoding has been set for the response
- The explicit character encoding set is not the same as fileEncoding
In this case only, Tomcat reads the bytes from the file, converts them to
characters using fileEncoding, converts those characters back to bytes using
the explicitly declared encoding and then writes those bytes to the response.

All of this is sufficiently complex that we have over 3,000 unit tests checking
various combinations.

Given the above, another solution would be to use the AddDefaultCharsetFilter
to set all *.html files to have UTF-8 explicitly set.

There might be a case to add an option to the default servlet to add an
explicit encoding to text responses that don't have one. It would probably need
to allow for:
- same as fileEncoding
- same as web application default response encoding
- explicit charset
But I do wonder how much stuff that would actually break rather than fix.

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org