You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@creadur.apache.org by "P. Ottlinger" <po...@apache.org> on 2015/02/17 23:59:31 UTC

RAT-190 - default encoding UTF-8 / patch / what should be implemented?

Hi *,

after finalizing the analysis on
https://issues.apache.org/jira/browse/RAT-190
it seems that RAT is not explicit enough when it comes to encoding.

CAUSE/BUG BACKGROUND
If mvn is configured to run with a non UTF-8 encoding there will be
problems when matching UTF-8 content with licenses.

PATCH PROPOSAL
I've browsed over some of the code parts and added some "UTF-8" to make
it more explicit that UTF-8 should be the default. What do you think of
that proposal?

YOU FEEDBACK WANTED
1) Is it sufficient enough?
2a) Should we have a RAT configuration option to allow specific setting
of encodings? With UTF-8 as default if not configured/set otherwise.
2) Should we just use UTF-8 as default (hardcoded) and do not give the
user a chance to set the encoding to use.

IMPROVE TESTABILITY?
Since we seem to run with UTF-8 encoding in Jenkins we did not see these
problems before. Does anyone have a good idea on how to test this?
A UTF-8 encoded file should be analysed with mvn -Dfile.encoding!=UTF-8?

Cheers & thanks for any opinions :-)

Phil

Re: RAT-190 - default encoding UTF-8 / patch / what should be implemented?

Posted by "P. Ottlinger" <po...@apache.org>.
Hi *,

Am 24.02.2015 um 00:53 schrieb sebb:
>>> >> The only change, that I'd be in favour of would be to enforce an
>>> >> explicit encoding. Or, in other words, throw an exception, if an
>>> >> encoding (aka charset) isn't explicitly choosen.
>> >
>> > What do you think of adding two more configuration options in the
>> > mvn-plugin:
>> >
>> > defaultLocale - defaults to Locale.US
> Why US?
> What is the Locale used for?
> And why should it differ from the user's Locale?

I chose Locale.US since it is hard-coded into RAT at the moment.

UTF-8 seems fine to me for most modern operating systems.

Please verify the current behaviour of RAT with the test project
attached at RAT-190.

In case you do not want any defaults/changes how can we make RAT more
encoding-aware to give a user a hint that his encoding configuration may
be faulty?!

Cheers,
Phil




Re: RAT-190 - default encoding UTF-8 / patch / what should be implemented?

Posted by sebb <se...@gmail.com>.
On 23 February 2015 at 20:31, P. Ottlinger <po...@apache.org> wrote:
> Hi *,
>
> thanks for all your input.
>
> Am 19.02.2015 um 11:30 schrieb Jochen Wiedmann:
>> My personal believe is, that a default doesn't make sense at all.

I tend to agree here.
But the behaviour on various systems needs to be checked.

>> Whatever you choose, you'll find people that cannot use it. For
>> example, in the case of UTF-8, I am quite certain that it will be
>> wrong for western europeans, like you and me.
>
> I don't really see your point in that - most *nix operating systems have
> UTF-8 as default encoding. The sense of UTF-8 is to provide a relatively
> broad compatibility in contrast to US-ASCII/CP1292 or other reduced
> charsets. Since UTF-16 exists UTF-8 is a compromise that should work for
> the majority of users - IMHO.

Probably does not work for MacOSX or Windows users.

This needs to be checked.

>
>> The only change, that I'd be in favour of would be to enforce an
>> explicit encoding. Or, in other words, throw an exception, if an
>> encoding (aka charset) isn't explicitly choosen.
>
> What do you think of adding two more configuration options in the
> mvn-plugin:
>
> defaultLocale - defaults to Locale.US

Why US?
What is the Locale used for?
And why should it differ from the user's Locale?

> defaultEncoding - defaults to UTF-8

Only if it can be shown to be useful on non-US non-Unix systems.

> With that a user wanting to use a reduced charset or with mixed contents
> to use RAT on could configure it.
>
> I'd like to replace all UTF-8 in the code with the value of that
> default. Same applies for Locale?
>
> This would at least make it transparent what is going on.
>
> WHAT HAPPENED IN RAT-190?
> Just as a quick reminder: a user ran RAT in a CP1292 encoded environment
> and did not find license matches in a UTF-8 encoded file.
> If mvn ran with UTF-8 via -Dfile.encoding=UTF-8 everything was fine and
> RAT was able to match.
>
> The to my mind correct assertion of the RAT user is to either provide
> meaningful defauls or make it possible to configure encoding-specific stuff.
>
> What do you think about adding those 2 options with above defaults?
>
> Cheers,
> Phil
>

Re: RAT-190 - default encoding UTF-8 / patch / what should be implemented?

Posted by sebb <se...@gmail.com>.
On 26 February 2015 at 07:15, Jochen Wiedmann <jo...@gmail.com> wrote:
>> I guess if we ever start RAT2 with the ability to scan in parallel we
>> should properly think about mixed encodings ;-D
>
> Amen to that. In fact, I have wasted some thoughts about that in the
> past. Things I'd like to see being implemented:
>
> - A plugin system
>   * Licenses aren't part of the core, but are developed, and
> distributed as plugins.
>   * In particular: Licenses can be added as plugins.
> - Proper parsing of files. For instance: Rather

?

Looks like there is something missing here.

> - Radically shortened code by making use of modern IoC principles
> (commons-inject?)
> - Proper handling of text files. Including specification of encoding
> on a directory level,
>   or even a file level.
> - Enhance the core, so that Ant (Maven) specific parts can be
> minimized. We have too
>   many issues, that are specific to CLI (Ant) (Maven), rather than
> common to all.

One of such issues is the handling of filename patterns.
I think there are 3 different syntaxes here: CLI, Ant, Maven.

Not much can be done about synchronising the different Ant/Maven
syntaxes, but the CLI should use either Ant or Maven style (or perhaps
user can choose which) rather than it's own incompatible syntax.

> Jochen
>
> --
> Any world that can produce the Taj Mahal, William Shakespeare,
> and Stripe toothpaste can't be all bad. (C.R. MacNamara, One Two Three)

Re: RAT-190 - default encoding UTF-8 / patch / what should be implemented?

Posted by Jochen Wiedmann <jo...@gmail.com>.
> I guess if we ever start RAT2 with the ability to scan in parallel we
> should properly think about mixed encodings ;-D

Amen to that. In fact, I have wasted some thoughts about that in the
past. Things I'd like to see being implemented:

- A plugin system
  * Licenses aren't part of the core, but are developed, and
distributed as plugins.
  * In particular: Licenses can be added as plugins.
- Proper parsing of files. For instance: Rather
- Radically shortened code by making use of modern IoC principles
(commons-inject?)
- Proper handling of text files. Including specification of encoding
on a directory level,
  or even a file level.
- Enhance the core, so that Ant (Maven) specific parts can be
minimized. We have too
  many issues, that are specific to CLI (Ant) (Maven), rather than
common to all.

Jochen

-- 
Any world that can produce the Taj Mahal, William Shakespeare,
and Stripe toothpaste can't be all bad. (C.R. MacNamara, One Two Three)

Re: RAT-190 - default encoding UTF-8 / patch / what should be implemented?

Posted by "P. Ottlinger" <po...@apache.org>.
Hi Jochen,

thanks for your input -

Am 24.02.2015 um 15:48 schrieb Jochen Wiedmann:
> What we should have, of course, is a Maven parameter
> 
>   encoding=${rat.encoding}
> 
> which might be used to overwrite the Maven default.

I played around with the test project (attached to RAT-190) a bit.

If I run mvn on a Mac with -Dfile.encoding=CP-1292 it does not change
the mvn default encoding UTF-8 .... thus I doubt that an encoding
parameter would make sense.

Since we cannot manage to find a common denominator I'd suggest to not
change anything, but add a note to the webpage - can someone add
something like:

"Be aware to define a maven default encoding (e.g. via
-Dfile.encoding=UTF-8 or in your pom.xml) in case you define acceptable
licenses by yourself and have a mix of encodings in your source files.
RAT uses the maven default encoding while scanning your files and is not
able to detect a mix of encodings that may result in false alarms."

Would that note be a sufficient compromise?

I guess if we ever start RAT2 with the ability to scan in parallel we
should properly think about mixed encodings ;-D

Phil


Re: RAT-190 - default encoding UTF-8 / patch / what should be implemented?

Posted by Jochen Wiedmann <jo...@gmail.com>.
On Mon, Feb 23, 2015 at 9:31 PM, P. Ottlinger <po...@apache.org> wrote:

> What do you think of adding two more configuration options in the
> mvn-plugin:
>
> defaultLocale - defaults to Locale.US
> defaultEncoding - defaults to UTF-8

"Default" = What to choose, if the user doesn't specify anything. So,
I don't see any sense in specifying a default, apart from what Maven
gives us anyways:

  project.getBuild().getSourceEncoding()

or

  ${project.build.sourceEncoding}

Besides, a charset, or locale is sufficient. What would we need a
Locale for? We're not going to format numbers, or currency values,
aren't we?

What we should have, of course, is a Maven parameter

  encoding=${rat.encoding}

which might be used to overwrite the Maven default.

Jochen



-- 
Any world that can produce the Taj Mahal, William Shakespeare,
and Stripe toothpaste can't be all bad. (C.R. MacNamara, One Two Three)

Re: RAT-190 - default encoding UTF-8 / patch / what should be implemented?

Posted by "P. Ottlinger" <po...@apache.org>.
Hi *,

thanks for all your input.

Am 19.02.2015 um 11:30 schrieb Jochen Wiedmann:
> My personal believe is, that a default doesn't make sense at all.
> Whatever you choose, you'll find people that cannot use it. For
> example, in the case of UTF-8, I am quite certain that it will be
> wrong for western europeans, like you and me.

I don't really see your point in that - most *nix operating systems have
UTF-8 as default encoding. The sense of UTF-8 is to provide a relatively
broad compatibility in contrast to US-ASCII/CP1292 or other reduced
charsets. Since UTF-16 exists UTF-8 is a compromise that should work for
the majority of users - IMHO.


> The only change, that I'd be in favour of would be to enforce an
> explicit encoding. Or, in other words, throw an exception, if an
> encoding (aka charset) isn't explicitly choosen.

What do you think of adding two more configuration options in the
mvn-plugin:

defaultLocale - defaults to Locale.US
defaultEncoding - defaults to UTF-8

With that a user wanting to use a reduced charset or with mixed contents
to use RAT on could configure it.

I'd like to replace all UTF-8 in the code with the value of that
default. Same applies for Locale?

This would at least make it transparent what is going on.

WHAT HAPPENED IN RAT-190?
Just as a quick reminder: a user ran RAT in a CP1292 encoded environment
and did not find license matches in a UTF-8 encoded file.
If mvn ran with UTF-8 via -Dfile.encoding=UTF-8 everything was fine and
RAT was able to match.

The to my mind correct assertion of the RAT user is to either provide
meaningful defauls or make it possible to configure encoding-specific stuff.

What do you think about adding those 2 options with above defaults?

Cheers,
Phil


Re: RAT-190 - default encoding UTF-8 / patch / what should be implemented?

Posted by Jochen Wiedmann <jo...@gmail.com>.
My personal believe is, that a default doesn't make sense at all.
Whatever you choose, you'll find people that cannot use it. For
example, in the case of UTF-8, I am quite certain that it will be
wrong for western europeans, like you and me.

The only change, that I'd be in favour of would be to enforce an
explicit encoding. Or, in other words, throw an exception, if an
encoding (aka charset) isn't explicitly choosen.

Jochen


On Tue, Feb 17, 2015 at 11:59 PM, P. Ottlinger <po...@apache.org> wrote:
> Hi *,
>
> after finalizing the analysis on
> https://issues.apache.org/jira/browse/RAT-190
> it seems that RAT is not explicit enough when it comes to encoding.
>
> CAUSE/BUG BACKGROUND
> If mvn is configured to run with a non UTF-8 encoding there will be
> problems when matching UTF-8 content with licenses.
>
> PATCH PROPOSAL
> I've browsed over some of the code parts and added some "UTF-8" to make
> it more explicit that UTF-8 should be the default. What do you think of
> that proposal?
>
> YOU FEEDBACK WANTED
> 1) Is it sufficient enough?
> 2a) Should we have a RAT configuration option to allow specific setting
> of encodings? With UTF-8 as default if not configured/set otherwise.
> 2) Should we just use UTF-8 as default (hardcoded) and do not give the
> user a chance to set the encoding to use.
>
> IMPROVE TESTABILITY?
> Since we seem to run with UTF-8 encoding in Jenkins we did not see these
> problems before. Does anyone have a good idea on how to test this?
> A UTF-8 encoded file should be analysed with mvn -Dfile.encoding!=UTF-8?
>
> Cheers & thanks for any opinions :-)
>
> Phil



-- 
Any world that can produce the Taj Mahal, William Shakespeare,
and Stripe toothpaste can't be all bad. (C.R. MacNamara, One Two Three)

Re: RAT-190 - default encoding UTF-8 / patch / what should be implemented?

Posted by sebb <se...@gmail.com>.
On 17 February 2015 at 22:59, P. Ottlinger <po...@apache.org> wrote:
> Hi *,
>
> after finalizing the analysis on
> https://issues.apache.org/jira/browse/RAT-190
> it seems that RAT is not explicit enough when it comes to encoding.
>
> CAUSE/BUG BACKGROUND
> If mvn is configured to run with a non UTF-8 encoding there will be
> problems when matching UTF-8 content with licenses.
>
> PATCH PROPOSAL
> I've browsed over some of the code parts and added some "UTF-8" to make
> it more explicit that UTF-8 should be the default. What do you think of
> that proposal?
>
> YOU FEEDBACK WANTED
> 1) Is it sufficient enough?
> 2a) Should we have a RAT configuration option to allow specific setting
> of encodings? With UTF-8 as default if not configured/set otherwise.
> 2) Should we just use UTF-8 as default (hardcoded) and do not give the
> user a chance to set the encoding to use.
>
> IMPROVE TESTABILITY?
> Since we seem to run with UTF-8 encoding in Jenkins we did not see these
> problems before. Does anyone have a good idea on how to test this?
> A UTF-8 encoded file should be analysed with mvn -Dfile.encoding!=UTF-8?
>
> Cheers & thanks for any opinions :-)

Seems to me that there are several potentially different encodings
involved here.

The encoding used for the license file templates.
Ones that are defined in built-in strings should not be an issue, but
the templates can be externally provided, either as files or as part
of the pom.
The encoding used for the files being checked.

I think we can assume that all the source files will have the same
encoding, but that may differ from the templates.
I assume that Maven takes care of the encoding when interpreting the pom.

If external templates are used, then can we insist that these use the
same encoding as the source files?

Does RAT include any template files? If so, presumably they have a
fixed (known) encoding.
Does RAT know when reading a template file whether it is a 3rd party
file or not?

> Phil