You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@maven.apache.org by "Aaron Digulla (Jira)" <ji...@apache.org> on 2020/07/15 12:41:00 UTC
[jira] [Comment Edited] (MRESOURCES-171) ISO8859-1 properties files get changed into UTF-8 when filtered

    [ https://issues.apache.org/jira/browse/MRESOURCES-171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158128#comment-17158128 ] 

Aaron Digulla edited comment on MRESOURCES-171 at 7/15/20, 12:40 PM:
---------------------------------------------------------------------

Short discussion regarding the default value:

project.build.sourceEncoding:

Pro: It's not a breaking change.

Con: 99% of all Java developers are not aware that the problem even exists. Many are US developers who don't care about characters outside the ASCII charset, so they're not affected. This would mean that most builds will stay broken without anyone noticing. Only when translations into other languages are added, weird things will happen and people will be confused.

Frankly, even many developers in Europe don't understand the problem (just look at the comments here where people argue UFT-8 is good/better/valid when it's clearly not).

ISO-8859-1:

Pro: That's what it should have been all along.

ISO-8859-1 can process UTF-8 unchanged since the encoding is binary stable (every byte of input maps to the same byte of output). So while a human would see those UTF-8 sequences for umlauts and special characters, the computer doesn't care. This can only fail when people use resource filtering and try to replace a variable with a System property with special characters. Pure ASCII replacements still work. That's the only corner case where we get the dreaded UTF-8 sequence unrolling (where you start to see those Ã characters).

Con: There is a chance that builds will break if people added the wrong workaround to fix the issue. One fix would be the complex config above. As far as I can tell, the fix above is compatible with ISO-8859-1 as default. It can get messy when people have changed the loading code to use UTF-8.

That being said, if you would chose the default to stay UTF-8, projects would silently fail for a long time without anyone noticing. I think this is bad. When something is broken, it should blow up in a way that people can see and do something about it.

So as I see it, using the correct default (as Java defines it) will break a small number of builds but the fix is easy: Remove all workarounds. If people really don't like it, they can stay with the old version of the plugin. That's just a two minute change in the POM.

What I would like is a warning or error when you're affected. Maybe we should check for characters with codePoint >= 128 && check whether resource filtering is enabled and print a warning?


was (Author: digulla):
Short discussion regarding the default value:

project.build.sourceEncoding:

Pro: It's not a breaking change.

Con: 99% of all Java developers are not aware that the problem even exists. Many are US developers who don't care about characters outside the ASCII charset, so they're not affected. This would mean that most builds will stay broken without anyone noticing. Only when translations into other languages are added, weird things will happen and people will be confused.

ISO-8859-1:

Pro: That's what it should have been all along.

ISO-8859-1 can process UTF-8 unchanged since the encoding is binary stable (every byte of input maps to the same byte of output). So while a human would see those UTF-8 sequences for umlauts and special characters, the computer doesn't care. This can only fail when people use resource filtering and try to replace a variable with a System property with special characters. Pure ASCII replacements still work. That's the only corner case where we get the dreaded UTF-8 sequence unrolling (where you start to see those Ã characters).

Con: There is a chance that builds will break if people added the wrong workaround to fix the issue. One fix would be the complex config above. As far as I can tell, the fix above is compatible with ISO-8859-1 as default. It can get messy when people have changed the loading code to use UTF-8.

That being said, if you would chose the default to stay UTF-8, projects would silently fail for a long time without anyone noticing. I think this is bad. When something is broken, it should blow up in a way that people can see and do something about it.

So as I see it, using the correct default (as Java defines it) will break a small number of builds but the fix is easy: Remove all workarounds.

What I would like is a warning or error when you're affected. Maybe we should check for characters with codePoint >= 128 && check whether resource filtering is enabled and print a warning?

> ISO8859-1 properties files get changed into UTF-8 when filtered
> ---------------------------------------------------------------
>
>                 Key: MRESOURCES-171
>                 URL: https://issues.apache.org/jira/browse/MRESOURCES-171
>             Project: Maven Resources Plugin
>          Issue Type: Bug
>          Components: filtering
>            Reporter: Alex Collins
>            Priority: Minor
>         Attachments: filtering-bug.zip
>
>
> Create:
> src/main/resources/test.properties
> And add a ISO8859-1 character that is not ASCII or UTF-8, do not use \uXXXX formatting.
> When adding this line:
> <resource><directory>src/main/resources</directory><filtering>true</filtering></resource>
> Expected:
> ISO8859-1 encoded file in jar.
> Actual:
> UTF-8 encoded file in jar.
> ---
> If there are any property files (which can only be ISO8859-1) they appear to be converted into UTF-8 in the jar.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)