You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@tomcat.apache.org by Carsten Klein <c....@datagis.com> on 2021/05/28 07:14:03 UTC

Encoding of LocalStrings_xy.properties files

Hi there,

I'm facing character set encoding problems in quite a recent Tomcat 10 
setup. I noticed that with the http://localhost:8080/manager/html 
application in a browser (my browser) set to German language.

My Tomcat runs from within Eclipse, built with the official build.xml 
file. I'm using my forked cklein05/tomcat GitHub repository, which is 
nearly up to date with your main branch.

In the Manager application, there are texts which contain German 
umlauts, like "Lösche Sitzungen" (Expire sessions, aka 
htmlManagerServlet.appsExpire).

These buttons now have captions that look like "Lösche Sitzungen". 
Obviously that's an UTF-8 <-> ISO-xxxx-y conversion issue.

I'm pretty sure that my setup is not causing that problems. After 
digging into GitHub, I found that recently someone converted many (or 
all) messages files to UTF-8:

https://github.com/apache/tomcat/commit/90fe08bdee0494110bb8145d2f067b61f74ae429

However, since these language files are actually java.util.Properties 
files, these must be encoded as ISO-8859-1:

https://docs.oracle.com/javase/8/docs/api/java/util/Properties.html#load-java.io.InputStream-

That's also true for more recent versions of Java.

The language files are actually Properties files in a (according do 
Javadoc) "simple line-oriented format". These must be loaded with the 
Properties.load method(s) and must always be in ISO-8859-1. In contrast, 
there are XML-based Properties files, that must be loaded with method(s) 
loadFromXML(...). Only these must be encoded in UTF-8.

Although editing international language files in ISO-8859-1 requires 
many \uXXXX escapes and is a hassle, for my mind, converting these 
plain-text language files to UFT-8 was likely not a good idea.

But why don't others report that problem? Am I overlooking something?

According to my explanation above, that problem is neither limited to 
German language nor to the Manager application. It should occur with any 
language using non-ascii characters (> 127) and with all localized text 
resources Tomcat is using.

Carsten


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: Encoding of LocalStrings_xy.properties files

Posted by Carsten Klein <c....@datagis.com>.
Mark,

On 01/06/2021 09:15, Mark Thomas wrote:

</snip>

> Start Tomcat with:
> catalina jpda run
> (or start but I typically use run as I nearly always want to see what is 
> logged to the console)
> 
> In Eclipse go to Debug > Debug Configurations > Remote Java Application 
>  > New Configuration. Browse to the project and then click Debug. 
> Tomcat's default jpda config matches Eclipse's so so should then have a 
> remote debug session set up with your Tomcat instance.

Trying that soon. Many thanks.

Carsten

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: Encoding of LocalStrings_xy.properties files

Posted by Mark Thomas <ma...@apache.org>.
On 28/05/2021 10:13, Carsten Klein wrote:
> 
> Mark,
> 
> On 28/05/2021 10:35, Mark Thomas wrote:
> 
> </not quoting anything>
> 
> No doubt that UTF-8 is the better encoding for messages and language 
> files. And yes, my Eclipse actually does not use the version built by 
> Ant. I use the start-tomcat.launch configuration file for starting 
> Tomcat. Actually it only takes a startup-class name. So, it must 
> obviously use the JARs built by Eclipse.
> 
> The trick is, that in the build.xml file, you are actually converting 
> message files:
> 
> <!-- Convert the message files from UTF-8 to ASCII. This can be removed
> after upgrading to Java 9+ as the minimum JRE and specifying the 
> encoding when loading the ResourceBundles -->
> 
> Simple. However, you do that after having them copied. While copying, 
> you use filtering-copy and specify ISO-8859-1 as the file's encoding:
> 
> <!-- Copy static resource files -->
> <copy todir="${tomcat.classes}" encoding="ISO-8859-1">
>    <filterset refid="version.filters"/>
>    <fileset dir="java">
>      <include name="**/*.properties"/>
>      <exclude name="**/LocalStrings*.properties"/>
>      [...]
> 
> Should be UTF-8 now?

Strictly, yes. Practically, it makes no difference because the filters 
that are applied do find and replacement with ASCII strings and are 
highly unlikely to ever be anything other than ASCII.

I'll get that updated.

> Back to the Eclipse. I guess there is not much difference between 
> calling Ant from the console and using Eclipse's Ant support (Run As -> 
> Ant build). But, how to start that with support for debugging in Eclipse 
> (may be a dumb questing, I know)?

Start Tomcat with:
catalina jpda run
(or start but I typically use run as I nearly always want to see what is 
logged to the console)

In Eclipse go to Debug > Debug Configurations > Remote Java Application 
 > New Configuration. Browse to the project and then click Debug. 
Tomcat's default jpda config matches Eclipse's so so should then have a 
remote debug session set up with your Tomcat instance.

Mark

> 
> Carsten
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> For additional commands, e-mail: users-help@tomcat.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: Encoding of LocalStrings_xy.properties files

Posted by Carsten Klein <c....@datagis.com>.
Mark,

On 28/05/2021 10:35, Mark Thomas wrote:

</not quoting anything>

No doubt that UTF-8 is the better encoding for messages and language 
files. And yes, my Eclipse actually does not use the version built by 
Ant. I use the start-tomcat.launch configuration file for starting 
Tomcat. Actually it only takes a startup-class name. So, it must 
obviously use the JARs built by Eclipse.

The trick is, that in the build.xml file, you are actually converting 
message files:

<!-- Convert the message files from UTF-8 to ASCII. This can be removed
after upgrading to Java 9+ as the minimum JRE and specifying the 
encoding when loading the ResourceBundles -->

Simple. However, you do that after having them copied. While copying, 
you use filtering-copy and specify ISO-8859-1 as the file's encoding:

<!-- Copy static resource files -->
<copy todir="${tomcat.classes}" encoding="ISO-8859-1">
   <filterset refid="version.filters"/>
   <fileset dir="java">
     <include name="**/*.properties"/>
     <exclude name="**/LocalStrings*.properties"/>
     [...]

Should be UTF-8 now?

Back to the Eclipse. I guess there is not much difference between 
calling Ant from the console and using Eclipse's Ant support (Run As -> 
Ant build). But, how to start that with support for debugging in Eclipse 
(may be a dumb questing, I know)?

Carsten

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org


Re: Encoding of LocalStrings_xy.properties files

Posted by Mark Thomas <ma...@apache.org>.
On 28/05/2021 08:14, Carsten Klein wrote:
> Hi there,
> 
> I'm facing character set encoding problems in quite a recent Tomcat 10 
> setup. I noticed that with the http://localhost:8080/manager/html 
> application in a browser (my browser) set to German language.
> 
> My Tomcat runs from within Eclipse, built with the official build.xml 
> file.

I suspect that that is not actually the case and that Eclipse is running 
from its own copy of the source and compiled classes.

> I'm using my forked cklein05/tomcat GitHub repository, which is 
> nearly up to date with your main branch.
> 
> In the Manager application, there are texts which contain German 
> umlauts, like "Lösche Sitzungen" (Expire sessions, aka 
> htmlManagerServlet.appsExpire).
> 
> These buttons now have captions that look like "Lösche Sitzungen". 
> Obviously that's an UTF-8 <-> ISO-xxxx-y conversion issue.
> 
> I'm pretty sure that my setup is not causing that problems.

Yes, it is.

> After 
> digging into GitHub, I found that recently someone converted many (or 
> all) messages files to UTF-8:
> 
> https://github.com/apache/tomcat/commit/90fe08bdee0494110bb8145d2f067b61f74ae429 
> 
> 
> However, since these language files are actually java.util.Properties 
> files,

Not quite. They are java.util.ResourceBundle files.

> these must be encoded as ISO-8859-1:
> 
> https://docs.oracle.com/javase/8/docs/api/java/util/Properties.html#load-java.io.InputStream- 
> 
> That's also true for more recent versions of Java.

Not for ResourceBundle. As of Java 9, an encoding can be specified. As 
soon as the minimum required version of Java is >=9, we'll switch to 
that method of loading.

> The language files are actually Properties files in a (according do 
> Javadoc) "simple line-oriented format". These must be loaded with the 
> Properties.load method(s) and must always be in ISO-8859-1. In contrast, 
> there are XML-based Properties files, that must be loaded with method(s) 
> loadFromXML(...). Only these must be encoded in UTF-8.
> 
> Although editing international language files in ISO-8859-1 requires 
> many \uXXXX escapes and is a hassle, for my mind, converting these 
> plain-text language files to UFT-8 was likely not a good idea.

The Tomcat maintainers disagree. Using UTF-8 makes maintenance 
significantly simpler and allowed integration with poeditor.com that has 
enabled 175 contributors (at today's count) to contribute new and 
improved translations including complete translations in Chinese and Korean.

One thing you do need to be aware of is the use of MessageFormat. Any 
string that contains {n} will be passed through MessageFormat so any 
single quote characters in the string need to be escaped with a second 
single quote. Apart from a few special cases, any instance of {n} is 
surrounded by [] to give [{n}] so that replaced values are clearly 
delimited. This is to help with issues around empty values and 
leading/trailing spaces that are otherwise not immediately obvious in 
the logs.

> But why don't others report that problem?

A few people have. It has always been when running from the source 
within an IDE.

> Am I overlooking something?

https://github.com/apache/tomcat/blob/main/build.xml#L998

> According to my explanation above, that problem is neither limited to 
> German language nor to the Manager application. It should occur with any 
> language using non-ascii characters (> 127) and with all localized text 
> resources Tomcat is using.

The issue is going to be some variation of Eclipse loading the 
ResourceBundle instances from the original source files rather than from 
the transformed versions created by the build process.

Not strictly relevant here but while Eclipse is my IDE of choice, I have 
always built Tomcat from the command line and used remote debugging if I 
need to step through the code. My (admittedly quite dated) experience 
with the various plug-ins that can be used run Tomcat inside Eclipse has 
never been good. The problems were usually around picking up updates to 
code and/or figuring out where configuration files were being read from.

Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org