You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tomcat.apache.org by Christopher Schultz <ch...@christopherschultz.net> on 2020/02/10 20:58:58 UTC

UTF-8 properties files and BOMs

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

All,

I've recently begun making a change to my application's resource
bundles, converting them into UTF-8 for readability and converting
them to ISO-8859-1 during my build process to make ResourceBundle happy.

I have everything working, except that Eclipse still thinks that my
files ought to be ISO-8859-1 and ruins them when I load them.
Sometimes, it's very obvious and that's not a problem: a developer
will see that and fix it before continuing. But some files are only
*slightly* broken by this and someone might make a mistake.

NOTE: We don't keep Eclipse settings in revision-control, so I can't
modify everyone's Eclipse configuration. We are using svn and
svn:mime-type is correctly set for these files; Eclipse just ignores tha
t.

Anyway, I found that adding a UTF-8 BOM to the beginning of the file
fixes that issue and Eclipse does the right thing.

As a sanity check. I looked at how Tomcat's files are laid-out and I
don't see any BOMs.

Should we add BOMs? Is there any reason NOT to use a BOM? These are
file types that are officially supposed to be ISO-8859-1 but everyone
wants to handle them differently, so I think adding BOMs might be a
good idea so that editors are always informed of exactly what's happenin
g.

WDYT?

- -chris
-----BEGIN PGP SIGNATURE-----
Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/

iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAl5BxBIACgkQHPApP6U8
pFguIw//aJRuOHxjeNX5/Fh981aG3Vajr0+W3PvBFuF85WpNtGAHGA/V2uCy7uY3
arZBGrE9HvLu0m+1ezs2oB9/craJogOIm/H7n5nDHDAk+Y9S6IIvookTXUIQrMH/
Eqqv4FWfJpHQKj4jE43nWR6SyJsll7TZ3GJ5jwnq4DpcZodSsyzneKchyys14YTp
7+zALlaVZV4+82aAbFPc6Z3WIiHCNXrHrixYOLgq6XZ8hwS7TP//vMTkMn8Rq1CW
HJZ6k09+KLWTeMbPPUdbqFj8znl3UIbUSgx2Jq/MxNqZikBoiV9WYDAgFFNkA3OW
VOSNdLSmHqx+mAz3l2LaVXItb8cdHoK/zhRvzwYpq6oslApOcgn9ZQBxPBsmjx/8
PfyIUK7dnz3YO3fPBXEXtn9KyZ5lj98iarMPby2WIHJq1KJNslMUuAFTX9k6vL/1
WTcmF1VOfdBraWRhTZL5m9e48WIYXD1/jl+Px+R5MKRpBXgyKOIAZiAE/k2NsDsC
bsl1ua0ITpuVqeCXRdqn8YsBh8yBmFW35Z+5QDPoxM6o5A7EKirzAW8ILBGRZoGT
7HGFdW1/47vbSRztzInzKUvUAg0jtNVy9Yt8S+/mQfm2mlbqtKrHrjN+nqi9mh/H
SYe3kJRFgNTXIBLds847vZwOXoq+wtpOps5uTRUPAWjuzL1Rt+8=
=srFY
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org

Re: UTF-8 properties files and BOMs

Posted by Christopher Schultz <ch...@christopherschultz.net>.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Mark,

On 2/11/20 9:47 AM, Mark Thomas wrote:
> On 11/02/2020 14:26, Christopher Schultz wrote:
> 

>> This appears to be a bug in (at least old versions of) Java
>> and/or native2ascii. I've got local installations of Java 8, 11
>> (Adopt), 11 (Oracle), and 13 (OpenJDK), and only Java 8 has a
>> "native2ascii" binary present. I see ant's <native2ascii> task
>> has its own implementation, but it's probably very simple, just
>> like the native2ascii program itself. Java's Reader classes
>> incorrectly interpret the BOM as an actual character instead of
>> an ignorable UTF-8 control sequence.
> 
> But the chances of us being able to "fix" the Ant implementation
> are considerably higher :).

Fair enough. ant handles this completely, so I'm happy to file a bug
against it. It would, unfortunately, cause an incompatibility between
<native2ascii> and Java's (legacy) native2ascii program. The ant team
might reject the request. I guess that's no worse than the current
situation :)

>> Ensuring that the first line of the file is a comment or a blank
>> line fixes things:
>> 
>> # BOM first.property=foo second.property=bar
>> 
>> becomes:
>> 
>> \ufeff# BOM first.property=foo second.property=bar
> 
> Does the BOM end up creating an additional property in this case?

Probably, but who cares? Code is unlikely to do:

bundle.getProperty([UTf-8 bom])

And get confused by what comes out.

>>> Overall, I guess I am -0 on adding BOMs.
>> 
>> Okay. This is a fairly recent change to Tomcat, and frankly, we
>> (a) don't get a huge number of outside contributions which
>> include changes to the localized properties files (except for the
>> translation-only contributions, which have been great!) and (b)
>> often ignore the non-English translations in the first place
>> because we are lazy.
>> 
>> I think maybe this can stay on the back-burner until we see if we
>> end up with any problems.
> 
> Sounds reasonable to me. It looks like we have options if we need
> them but with a few minor issues to research / iron out first if we
> go that way.
> 
>> Does/can "checkstyle" check for valid UTF-8 byte sequences in 
>> .properties files? I think that may be a helpful check to add if
>> it's not already in there.
> 
> Don't know. +1 if such a thing exists.

I know nothing about Checkstyle, so I'll defer to anyone who does know
how to configure it to do these things :)

- -chris
-----BEGIN PGP SIGNATURE-----
Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/

iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAl5C07kACgkQHPApP6U8
pFjsZhAAoyo8KeqHqs1ZakdexBQJ8g1YHuGKC87SG3Guw/GoMFTjsyU9sWPyAnBP
wvizChhnWD3WaWKrEI+Tp4D35v/L1ORuwquDYIqRgxras+xvjnyzWDFfrYPA1WkF
RQ5Ns4A8f/lkPAb+4Y2xKN8wLnWY/zmJ5GmJ0fibyORqlAfANgUp16hHaT4bDRDM
AqPWbODT5YBhpTRurTqejJeXGJLfBFdxbH+liZdQ8uYeaYNSEV23YPXxVq5upgMD
daZxkusaacu6Uz1F0w/6uAJJ65xo+qzeANYmJ0Hn+jfrWwtgspTPOfPct9VSpuJ7
YnBcllm8vvshjGYB/83Q/IaWdKQvJ+BhHwLatuS5gz7EaM4V3ibZiwXDyPOMEoek
XeV983OgLw7IONEjhLXqKyooqywSpy9v0gU+GmRHh7fk453gFzBm3I7FF7FtZotw
XE8OyOmyjUuw48v+NcjR0fAQ+wzgBYRlVItICY1s/OMr2dDAWcDB1jG2nlSdf2TV
HGHqZrgvtOF+/v5wGCpZAdnjeU8qqOmk/m+SJwK76nfz11e79MMCkDBjiVypet6E
/LRbGzgjoZn3lAsApaLTKbp0kVaLEJlZ2Xg/DuzBCZWyvrGTiEEVC7Hr2aMsjsQq
v4NHfogOMz5zcxyJ8nxGNTK5JHBXNp//kg9SWWUCFvf7UJRDFWg=
=svCv
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org

Re: UTF-8 properties files and BOMs

Posted by Mark Thomas <ma...@apache.org>.

On 11/02/2020 14:26, Christopher Schultz wrote:

<snip/>

>> The thing that bugged me was having to manually switch properties
>> files to UTF-8 to view them "properly". You mail motivated me to
>> track down where I can change that in Eclipse:
> 
>> Window->Preferences->General->Content Types
> 
>> and I have changed Java properties files to use UTF-8. So that is
>> my personal niggle fixed. Thanks for the motivation.
> 
> Yes, this *will* fix things, but:
> 
> 1. It's a global setting, so it can't be set on a per-project basis.
> That means you have to be willing to convert ALL your properties files
> across ALL your projects to UTF-8. That may be okay for some people,
> but not all.

Fair point.

> 2. This is a guess: Tomcat's ide-eclipse ant target can't set that
> setting for the Tomcat project(s) because it's a global setting.
> Therefore, anyone using Eclipse as an IDE will have to manually set
> their content-type in order to NOT damage any of the files we ship.

I'm not sure about actual damage. I've see Eclipse manipulate UTF-8
files while configured to use ISO-8859-1 without issue. But maybe that
is actually git doing UTF-8 manipulation.

>> I was concerned that adding a BOM would cause problems when
>> reading property files. I've seen reports of that with Java in the
>> past. A quick test suggests that the issue is no longer present
>> with latest Java 8.
> 
> I actually had another problem after I implemented all of this: any
> property file without a blank and/or comment line at the top ended up
> with a mangled and unusable *first* property key. A file like this:
> 
> first.property=foo
> second.property=bar
> 
> Would end up line this after a trip through "native2ascii -encoding
> UTF-8":
> 
> \ufefffirst.property=foo
> second.property=bar

That is similar to the problems I recall with earlier versions of Java.

> native2ascii stupidly interprets the UTF-8 BOM as an actual character,
> and encodes it in the output.
> 
> This appears to be a bug in (at least old versions of) Java and/or
> native2ascii. I've got local installations of Java 8, 11 (Adopt), 11
> (Oracle), and 13 (OpenJDK), and only Java 8 has a "native2ascii"
> binary present. I see ant's <native2ascii> task has its own
> implementation, but it's probably very simple, just like the
> native2ascii program itself. Java's Reader classes incorrectly
> interpret the BOM as an actual character instead of an ignorable UTF-8
> control sequence.

But the chances of us being able to "fix" the Ant implementation are
considerably higher :).

> Ensuring that the first line of the file is a comment or a blank line
> fixes things:
> 
> # BOM
> first.property=foo
> second.property=bar
> 
> becomes:
> 
> \ufeff# BOM
> first.property=foo
> second.property=bar

Does the BOM end up creating an additional property in this case?

>> Overall, I guess I am -0 on adding BOMs.
> 
> Okay. This is a fairly recent change to Tomcat, and frankly, we (a)
> don't get a huge number of outside contributions which include changes
> to the localized properties files (except for the translation-only
> contributions, which have been great!) and (b) often ignore the
> non-English translations in the first place because we are lazy.
> 
> I think maybe this can stay on the back-burner until we see if we end
> up with any problems.

Sounds reasonable to me. It looks like we have options if we need them
but with a few minor issues to research / iron out first if we go that way.

> Does/can "checkstyle" check for valid UTF-8 byte sequences in
> .properties files? I think that may be a helpful check to add if it's
> not already in there.

Don't know. +1 if such a thing exists.

Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org

Re: UTF-8 properties files and BOMs

Posted by Martin Grigorov <mg...@apache.org>.

On Tue, Feb 11, 2020 at 4:27 PM Christopher Schultz <
chris@christopherschultz.net> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> On 2/11/20 2:37 AM, Martin Grigorov wrote:
> > I guess you use Java 8. Newer versions of Java try UTF-8 first and
> > then fallback to ISO-8859-1:
> https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/P
> ropertyResourceBundle.html
> <https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/PropertyResourceBundle.html>
> Correct, I am using Java 8:
>
> $ java -version
> openjdk version "1.8.0_232"
> OpenJDK Runtime Environment (build 1.8.0_232-8u232-b09-1~deb9u1-b09)
> OpenJDK 64-Bit Server VM (build 25.232-b09, mixed mode)
>
> This is the version that Debian 9 provides. I could install a a higher
> patch-version but would it help?
>
> On 2/11/20 6:38 AM, Mark Thomas wrote:
> > On 10/02/2020 20:58, Christopher Schultz wrote:
> >> All,
> >>
> >> I've recently begun making a change to my application's resource
> >> bundles, converting them into UTF-8 for readability and
> >> converting them to ISO-8859-1 during my build process to make
> >> ResourceBundle happy.
> >>
> >> I have everything working, except that Eclipse still thinks that
> >> my files ought to be ISO-8859-1 and ruins them when I load them.
> >> Sometimes, it's very obvious and that's not a problem: a
> >> developer will see that and fix it before continuing. But some
> >> files are only *slightly* broken by this and someone might make a
> >> mistake.
> >
> > I don't think we have seen this with Tomcat. Or have we (since we
> > switched to UTF-8)?
> >
> > The thing that bugged me was having to manually switch properties
> > files to UTF-8 to view them "properly". You mail motivated me to
> > track down where I can change that in Eclipse:
> >
> > Window->Preferences->General->Content Types
> >
> > and I have changed Java properties files to use UTF-8. So that is
> > my personal niggle fixed. Thanks for the motivation.
>
> Yes, this *will* fix things, but:
>
> 1. It's a global setting, so it can't be set on a per-project basis.
> That means you have to be willing to convert ALL your properties files
> across ALL your projects to UTF-8. That may be okay for some people,
> but not all.
>
> 2. This is a guess: Tomcat's ide-eclipse ant target can't set that
> setting for the Tomcat project(s) because it's a global setting.
> Therefore, anyone using Eclipse as an IDE will have to manually set
> their content-type in order to NOT damage any of the files we ship.
>
> >> NOTE: We don't keep Eclipse settings in revision-control, so I
> >> can't modify everyone's Eclipse configuration. We are using svn
> >> and svn:mime-type is correctly set for these files; Eclipse just
> >> ignores tha t.
> >
> > I've seen that too. While I found it rather annoying, it wasn't
> > annoying enough to try and find a fix as that looked like it would
> > require patching Eclipse and/or the svn plug-in.
> >
> >> Anyway, I found that adding a UTF-8 BOM to the beginning of the
> >> file fixes that issue and Eclipse does the right thing.
> >
> > Ah. So Eclipse *is* doing content scanning. Interesting.
>
> Well, it's not really *content* scanning. But a BOM is the official
> way to tell the difference between a UTF-8 encoded file and one that
> just happens to have a whole bunch of valid UTF-8 byte sequences
> through (most of) the file.
>
> >> As a sanity check. I looked at how Tomcat's files are laid-out
> >> and I don't see any BOMs.
> >
> > Correct. The only files in the code base that should have BOMs at
> > the moment are the ones in the test web application (under
> > bug49nnn) for testing the default Servlet's handling of files with
> > BOMs.
> >
> >> Should we add BOMs? Is there any reason NOT to use a BOM? These
> >> are file types that are officially supposed to be ISO-8859-1 but
> >> everyone wants to handle them differently, so I think adding BOMs
> >> might be a good idea so that editors are always informed of
> >> exactly what's happenin g.
> >>
> >> WDYT?
> >
> > I was concerned that adding a BOM would cause problems when
> > reading property files. I've seen reports of that with Java in the
> > past. A quick test suggests that the issue is no longer present
> > with latest Java 8.
>
> I actually had another problem after I implemented all of this: any
> property file without a blank and/or comment line at the top ended up
> with a mangled and unusable *first* property key. A file like this:
>
> first.property=foo
> second.property=bar
>
> Would end up line this after a trip through "native2ascii -encoding
> UTF-8":
>
> \ufefffirst.property=foo
> second.property=bar
>
> native2ascii stupidly interprets the UTF-8 BOM as an actual character,
> and encodes it in the output.
>
> This appears to be a bug in (at least old versions of) Java and/or
> native2ascii. I've got local installations of Java 8, 11 (Adopt), 11
> (Oracle), and 13 (OpenJDK), and only Java 8 has a "native2ascii"
> binary present. I see ant's <native2ascii> task has its own
> implementation, but it's probably very simple, just like the
> native2ascii program itself. Java's Reader classes incorrectly
> interpret the BOM as an actual character instead of an ignorable UTF-8
> control sequence.
>
> I can confirm that Java 13 still seems to have this problem: running
> ant's <native2ascii> under Java 13 still corrupts the first line of
> the file.
>
> Ensuring that the first line of the file is a comment or a blank line
> fixes things:
>
> # BOM
> first.property=foo
> second.property=bar
>
> becomes:
>
> \ufeff# BOM
> first.property=foo
> second.property=bar
>
> > With the use of POEditor and the import/export scripts we have, it
> > would be unusual for someone to be editing any of the property
> > files where UTF-8 vs ISO-8859-1 matters. Thinking about it a little
> > more, there would be a need to do this to edit non-English strings
> > in the older branches where the key doesn't exist in the latest
> > code. That strikes me as a fairly rare use case.
> >
> > My other worry is that some editors will fail to handle the BOM
> > correctly and we'll end up causing more issues than we solve. I've
> > little basis for that worry other than (possibly out of date)
> > experience.
> >
> > Overall, I guess I am -0 on adding BOMs.
>
> Okay. This is a fairly recent change to Tomcat, and frankly, we (a)
> don't get a huge number of outside contributions which include changes
> to the localized properties files (except for the translation-only
> contributions, which have been great!) and (b) often ignore the
> non-English translations in the first place because we are lazy.
>
> I think maybe this can stay on the back-burner until we see if we end
> up with any problems.
>
> Does/can "checkstyle" check for valid UTF-8 byte sequences in
> .properties files? I think that may be a helpful check to add if it's
> not already in there.
>

Just to add: I am a happy user of XML based properties files (since Java
1.5).
It is relatively simple to roll out XmlPropertyResourceBundle, e.g.
https://gist.github.com/asicfr/1b76ea60029264d7be15d019a866e1a4
This should solve your issues.


>
> - -chris
> -----BEGIN PGP SIGNATURE-----
> Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/
>
> iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAl5CuasACgkQHPApP6U8
> pFgKWBAAuQiF6fMD+LWDPkdiCWRIYPzPIPjSqHIOvn6iORC/RnJ2S2s8tsvu0K6E
> IVypbd016lOP5Mn1hLGNU80eYPo3xNzz8GrZgjXImG+xeFcZ0VL+FGCkpsE6UrlT
> LuxHi7Axq+sRhxf/iEuTxr/vS9sD5ggc5oc/TnVR1b1NETRX0M43uQFqoraOtHUE
> mCW6KgzqteEu8ca00YH8k73eeCOhIUybFdTXBBaf5VgxT+uQhM0ogIUFkls0KbSE
> sq+SCzIlb1ftSVI1Dp4ORRTH6sjaiBnboZLduJaBbyiqHCIBAwnyO++Qk3RBaWCS
> 4SoOfVF0LFGS5CRG/IZcKMhNctS/NzCa5ShsTFGhaDxqhn+CaaMq9jJlhNb7j1vG
> La/+cSYSp9h63ZohMh5M2r9FbT3nP3q6Tt7N2X40ALGxpMReSf4zF/lV9feHT9wM
> Yq4u6sPO7ACHfL+a4FST1jNPYeLJ4PfiSSv6LY663VZOg06JlVnT0P0SxWKvm7r8
> Y38Guw0m75jWPhM1s0wNGYvQ8t2rCMvjpIIedptmuk9IGyfBux20ms9RGjiir1wB
> BEdL/0opnJALG3qx1ver+vqfWMJbXpyUCnCPgVCPCtnprmSYrdpaif2hiGcIEqG+
> Q5aS3KPvmXN722ORgSXpRn/5Lym2dznMH2alRLbo/Gz/z3g2k4w=
> =T4mh
> -----END PGP SIGNATURE-----
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
> For additional commands, e-mail: dev-help@tomcat.apache.org
>
>

Re: UTF-8 properties files and BOMs

Posted by Christopher Schultz <ch...@christopherschultz.net>.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

On 2/11/20 2:37 AM, Martin Grigorov wrote:
> I guess you use Java 8. Newer versions of Java try UTF-8 first and
> then fallback to ISO-8859-1:
https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/P
ropertyResourceBundle.html
Correct, I am using Java 8:

$ java -version
openjdk version "1.8.0_232"
OpenJDK Runtime Environment (build 1.8.0_232-8u232-b09-1~deb9u1-b09)
OpenJDK 64-Bit Server VM (build 25.232-b09, mixed mode)

This is the version that Debian 9 provides. I could install a a higher
patch-version but would it help?

On 2/11/20 6:38 AM, Mark Thomas wrote:
> On 10/02/2020 20:58, Christopher Schultz wrote:
>> All,
>> 
>> I've recently begun making a change to my application's resource 
>> bundles, converting them into UTF-8 for readability and
>> converting them to ISO-8859-1 during my build process to make
>> ResourceBundle happy.
>> 
>> I have everything working, except that Eclipse still thinks that
>> my files ought to be ISO-8859-1 and ruins them when I load them. 
>> Sometimes, it's very obvious and that's not a problem: a
>> developer will see that and fix it before continuing. But some
>> files are only *slightly* broken by this and someone might make a
>> mistake.
> 
> I don't think we have seen this with Tomcat. Or have we (since we 
> switched to UTF-8)?
> 
> The thing that bugged me was having to manually switch properties
> files to UTF-8 to view them "properly". You mail motivated me to
> track down where I can change that in Eclipse:
> 
> Window->Preferences->General->Content Types
> 
> and I have changed Java properties files to use UTF-8. So that is
> my personal niggle fixed. Thanks for the motivation.

Yes, this *will* fix things, but:

1. It's a global setting, so it can't be set on a per-project basis.
That means you have to be willing to convert ALL your properties files
across ALL your projects to UTF-8. That may be okay for some people,
but not all.

2. This is a guess: Tomcat's ide-eclipse ant target can't set that
setting for the Tomcat project(s) because it's a global setting.
Therefore, anyone using Eclipse as an IDE will have to manually set
their content-type in order to NOT damage any of the files we ship.

>> NOTE: We don't keep Eclipse settings in revision-control, so I
>> can't modify everyone's Eclipse configuration. We are using svn
>> and svn:mime-type is correctly set for these files; Eclipse just
>> ignores tha t.
> 
> I've seen that too. While I found it rather annoying, it wasn't
> annoying enough to try and find a fix as that looked like it would
> require patching Eclipse and/or the svn plug-in.
> 
>> Anyway, I found that adding a UTF-8 BOM to the beginning of the
>> file fixes that issue and Eclipse does the right thing.
> 
> Ah. So Eclipse *is* doing content scanning. Interesting.

Well, it's not really *content* scanning. But a BOM is the official
way to tell the difference between a UTF-8 encoded file and one that
just happens to have a whole bunch of valid UTF-8 byte sequences
through (most of) the file.

>> As a sanity check. I looked at how Tomcat's files are laid-out
>> and I don't see any BOMs.
> 
> Correct. The only files in the code base that should have BOMs at
> the moment are the ones in the test web application (under
> bug49nnn) for testing the default Servlet's handling of files with
> BOMs.
> 
>> Should we add BOMs? Is there any reason NOT to use a BOM? These
>> are file types that are officially supposed to be ISO-8859-1 but
>> everyone wants to handle them differently, so I think adding BOMs
>> might be a good idea so that editors are always informed of
>> exactly what's happenin g.
>> 
>> WDYT?
> 
> I was concerned that adding a BOM would cause problems when
> reading property files. I've seen reports of that with Java in the
> past. A quick test suggests that the issue is no longer present
> with latest Java 8.

I actually had another problem after I implemented all of this: any
property file without a blank and/or comment line at the top ended up
with a mangled and unusable *first* property key. A file like this:

first.property=foo
second.property=bar

Would end up line this after a trip through "native2ascii -encoding
UTF-8":

\ufefffirst.property=foo
second.property=bar

native2ascii stupidly interprets the UTF-8 BOM as an actual character,
and encodes it in the output.

This appears to be a bug in (at least old versions of) Java and/or
native2ascii. I've got local installations of Java 8, 11 (Adopt), 11
(Oracle), and 13 (OpenJDK), and only Java 8 has a "native2ascii"
binary present. I see ant's <native2ascii> task has its own
implementation, but it's probably very simple, just like the
native2ascii program itself. Java's Reader classes incorrectly
interpret the BOM as an actual character instead of an ignorable UTF-8
control sequence.

I can confirm that Java 13 still seems to have this problem: running
ant's <native2ascii> under Java 13 still corrupts the first line of
the file.

Ensuring that the first line of the file is a comment or a blank line
fixes things:

# BOM
first.property=foo
second.property=bar

becomes:

\ufeff# BOM
first.property=foo
second.property=bar

> With the use of POEditor and the import/export scripts we have, it
> would be unusual for someone to be editing any of the property
> files where UTF-8 vs ISO-8859-1 matters. Thinking about it a little
> more, there would be a need to do this to edit non-English strings
> in the older branches where the key doesn't exist in the latest
> code. That strikes me as a fairly rare use case.
> 
> My other worry is that some editors will fail to handle the BOM 
> correctly and we'll end up causing more issues than we solve. I've 
> little basis for that worry other than (possibly out of date)
> experience.
> 
> Overall, I guess I am -0 on adding BOMs.

Okay. This is a fairly recent change to Tomcat, and frankly, we (a)
don't get a huge number of outside contributions which include changes
to the localized properties files (except for the translation-only
contributions, which have been great!) and (b) often ignore the
non-English translations in the first place because we are lazy.

I think maybe this can stay on the back-burner until we see if we end
up with any problems.

Does/can "checkstyle" check for valid UTF-8 byte sequences in
.properties files? I think that may be a helpful check to add if it's
not already in there.

- -chris
-----BEGIN PGP SIGNATURE-----
Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/

iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAl5CuasACgkQHPApP6U8
pFgKWBAAuQiF6fMD+LWDPkdiCWRIYPzPIPjSqHIOvn6iORC/RnJ2S2s8tsvu0K6E
IVypbd016lOP5Mn1hLGNU80eYPo3xNzz8GrZgjXImG+xeFcZ0VL+FGCkpsE6UrlT
LuxHi7Axq+sRhxf/iEuTxr/vS9sD5ggc5oc/TnVR1b1NETRX0M43uQFqoraOtHUE
mCW6KgzqteEu8ca00YH8k73eeCOhIUybFdTXBBaf5VgxT+uQhM0ogIUFkls0KbSE
sq+SCzIlb1ftSVI1Dp4ORRTH6sjaiBnboZLduJaBbyiqHCIBAwnyO++Qk3RBaWCS
4SoOfVF0LFGS5CRG/IZcKMhNctS/NzCa5ShsTFGhaDxqhn+CaaMq9jJlhNb7j1vG
La/+cSYSp9h63ZohMh5M2r9FbT3nP3q6Tt7N2X40ALGxpMReSf4zF/lV9feHT9wM
Yq4u6sPO7ACHfL+a4FST1jNPYeLJ4PfiSSv6LY663VZOg06JlVnT0P0SxWKvm7r8
Y38Guw0m75jWPhM1s0wNGYvQ8t2rCMvjpIIedptmuk9IGyfBux20ms9RGjiir1wB
BEdL/0opnJALG3qx1ver+vqfWMJbXpyUCnCPgVCPCtnprmSYrdpaif2hiGcIEqG+
Q5aS3KPvmXN722ORgSXpRn/5Lym2dznMH2alRLbo/Gz/z3g2k4w=
=T4mh
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org

Re: UTF-8 properties files and BOMs

Posted by Mark Thomas <ma...@apache.org>.

On 10/02/2020 20:58, Christopher Schultz wrote:
> All,
> 
> I've recently begun making a change to my application's resource
> bundles, converting them into UTF-8 for readability and converting
> them to ISO-8859-1 during my build process to make ResourceBundle happy.
> 
> I have everything working, except that Eclipse still thinks that my
> files ought to be ISO-8859-1 and ruins them when I load them.
> Sometimes, it's very obvious and that's not a problem: a developer
> will see that and fix it before continuing. But some files are only
> *slightly* broken by this and someone might make a mistake.

I don't think we have seen this with Tomcat. Or have we (since we
switched to UTF-8)?

The thing that bugged me was having to manually switch properties files
to UTF-8 to view them "properly". You mail motivated me to track down
where I can change that in Eclipse:

Window->Preferences->General->Content Types

and I have changed Java properties files to use UTF-8. So that is my
personal niggle fixed. Thanks for the motivation.

> NOTE: We don't keep Eclipse settings in revision-control, so I can't
> modify everyone's Eclipse configuration. We are using svn and
> svn:mime-type is correctly set for these files; Eclipse just ignores tha
> t.

I've seen that too. While I found it rather annoying, it wasn't annoying
enough to try and find a fix as that looked like it would require
patching Eclipse and/or the svn plug-in.

> Anyway, I found that adding a UTF-8 BOM to the beginning of the file
> fixes that issue and Eclipse does the right thing.

Ah. So Eclipse *is* doing content scanning. Interesting.

> As a sanity check. I looked at how Tomcat's files are laid-out and I
> don't see any BOMs.

Correct. The only files in the code base that should have BOMs at the
moment are the ones in the test web application (under bug49nnn) for
testing the default Servlet's handling of files with BOMs.

> Should we add BOMs? Is there any reason NOT to use a BOM? These are
> file types that are officially supposed to be ISO-8859-1 but everyone
> wants to handle them differently, so I think adding BOMs might be a
> good idea so that editors are always informed of exactly what's happenin
> g.
> 
> WDYT?

I was concerned that adding a BOM would cause problems when reading
property files. I've seen reports of that with Java in the past. A quick
test suggests that the issue is no longer present with latest Java 8.

With the use of POEditor and the import/export scripts we have, it would
be unusual for someone to be editing any of the property files where
UTF-8 vs ISO-8859-1 matters. Thinking about it a little more, there
would be a need to do this to edit non-English strings in the older
branches where the key doesn't exist in the latest code. That strikes me
as a fairly rare use case.

My other worry is that some editors will fail to handle the BOM
correctly and we'll end up causing more issues than we solve. I've
little basis for that worry other than (possibly out of date) experience.

Overall, I guess I am -0 on adding BOMs.

Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org

Re: UTF-8 properties files and BOMs

Posted by Martin Grigorov <mg...@apache.org>.

Hi Chris,

On Mon, Feb 10, 2020 at 10:59 PM Christopher Schultz <
chris@christopherschultz.net> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> All,
>
> I've recently begun making a change to my application's resource
> bundles, converting them into UTF-8 for readability and converting
> them to ISO-8859-1 during my build process to make ResourceBundle happy.
>

I guess you use Java 8.
Newer versions of Java try UTF-8 first and then fallback to ISO-8859-1:
https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/PropertyResourceBundle.html

API Note:
PropertyResourceBundle can be constructed either from an InputStream or a
Reader, which represents a property file. Constructing a
PropertyResourceBundle instance from an InputStream requires that the input
stream be encoded in UTF-8. By default, if a MalformedInputException
<https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/nio/charset/MalformedInputException.html>
 or an UnmappableCharacterException
<https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/nio/charset/UnmappableCharacterException.html>
 occurs on reading the input stream, then the PropertyResourceBundle instance
resets to the state before the exception, re-reads the input stream in
ISO-8859-1, and continues reading. If the system property
java.util.PropertyResourceBundle.encoding is set to either "ISO-8859-1" or
"UTF-8", the input stream is solely read in that encoding, and throws the
exception if it encounters an invalid sequence. If "ISO-8859-1" is
specified, characters that cannot be represented in ISO-8859-1 encoding
must be represented by Unicode Escapes as defined in section 3.3 of The
Java™ Language Specification whereas the other constructor which takes a
Reader does not have that limitation. Other encoding values are ignored for
this system property. The system property is read and evaluated when
initializing this class. Changing or removing the property has no effect
after the initialization.

>
> I have everything working, except that Eclipse still thinks that my
> files ought to be ISO-8859-1 and ruins them when I load them.
> Sometimes, it's very obvious and that's not a problem: a developer
> will see that and fix it before continuing. But some files are only
> *slightly* broken by this and someone might make a mistake.
>
> NOTE: We don't keep Eclipse settings in revision-control, so I can't
> modify everyone's Eclipse configuration. We are using svn and
> svn:mime-type is correctly set for these files; Eclipse just ignores tha
> t.
>
> Anyway, I found that adding a UTF-8 BOM to the beginning of the file
> fixes that issue and Eclipse does the right thing.
>
> As a sanity check. I looked at how Tomcat's files are laid-out and I
> don't see any BOMs.
>
> Should we add BOMs? Is there any reason NOT to use a BOM? These are
> file types that are officially supposed to be ISO-8859-1 but everyone
> wants to handle them differently, so I think adding BOMs might be a
> good idea so that editors are always informed of exactly what's happenin
> g.
>
> WDYT?
>

I don't use Eclipse so I cannot help you with that.
My gut feeling says: If it ain't broken then don't fix it. If no one
complained so far then probably such kind of improvement is not worth it.

>
> - -chris
> -----BEGIN PGP SIGNATURE-----
> Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/
>
> iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAl5BxBIACgkQHPApP6U8
> pFguIw//aJRuOHxjeNX5/Fh981aG3Vajr0+W3PvBFuF85WpNtGAHGA/V2uCy7uY3
> arZBGrE9HvLu0m+1ezs2oB9/craJogOIm/H7n5nDHDAk+Y9S6IIvookTXUIQrMH/
> Eqqv4FWfJpHQKj4jE43nWR6SyJsll7TZ3GJ5jwnq4DpcZodSsyzneKchyys14YTp
> 7+zALlaVZV4+82aAbFPc6Z3WIiHCNXrHrixYOLgq6XZ8hwS7TP//vMTkMn8Rq1CW
> HJZ6k09+KLWTeMbPPUdbqFj8znl3UIbUSgx2Jq/MxNqZikBoiV9WYDAgFFNkA3OW
> VOSNdLSmHqx+mAz3l2LaVXItb8cdHoK/zhRvzwYpq6oslApOcgn9ZQBxPBsmjx/8
> PfyIUK7dnz3YO3fPBXEXtn9KyZ5lj98iarMPby2WIHJq1KJNslMUuAFTX9k6vL/1
> WTcmF1VOfdBraWRhTZL5m9e48WIYXD1/jl+Px+R5MKRpBXgyKOIAZiAE/k2NsDsC
> bsl1ua0ITpuVqeCXRdqn8YsBh8yBmFW35Z+5QDPoxM6o5A7EKirzAW8ILBGRZoGT
> 7HGFdW1/47vbSRztzInzKUvUAg0jtNVy9Yt8S+/mQfm2mlbqtKrHrjN+nqi9mh/H
> SYe3kJRFgNTXIBLds847vZwOXoq+wtpOps5uTRUPAWjuzL1Rt+8=
> =srFY
> -----END PGP SIGNATURE-----
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
> For additional commands, e-mail: dev-help@tomcat.apache.org
>
>