You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@maven.apache.org by Michael Osipov <mi...@apache.org> on 2014/11/13 23:15:00 UTC

What do if project.build.sourceEncoding is not provided?

Hi folks,

I'd like to know if we have a general concensus on this:

I am investigating MPIR-242 and figured out the cause. The input stream 
is obtained from the HTTP URL and no encoding is given, so ISO-8859-1 is 
provided as default (yuck!). While I know that some reporting related 
modules have default values for input/output encoding, this contradicts 
our general approach to use platform encoding when 
project.build.sourceEncoding is not given.

In that special case, the behavior would be consistent if changed. 
Setting project.build.sourceEncoding to UTF-8 would solve the problem 
but is just a workaround. HTML resources carry their encoding with them 
but the ProjectInfoReportUtils treats everything as input streams (not 
helpful with XML/HTML). I would really like to avoid peeking with a 
pushback input stream.

How is your opinion on this?

I have two solutions in mind for the issue above:

1. Easy: remove ISO-8859-1, assume platform encoding if 
project.build.sourceEncoding is not provided.
2. Complex: use an HTML parser (JSoup is awesome and license-compatible 
[1]) to get correctly encoded content.
But how do you know that this URL really points to an HTML file and not 
a license.txt inspect content type?

[1] http://apache.org/legal/resolved.html#category-a

Michael

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@maven.apache.org
For additional commands, e-mail: dev-help@maven.apache.org


Re: What do if project.build.sourceEncoding is not provided?

Posted by Michael Osipov <mi...@apache.org>.
Very simple ;-)

Let's do it so.

Am 2014-11-14 um 19:50 schrieb Hervé BOUTEMY:
> more stupid simple:
> 1. If paremeter is set, use that regardless of the rest
> 2.  if not set, assume UTF-8
>
> Regards,
>
> Hervé
>
> Le vendredi 14 novembre 2014 19:30:54 Michael Osipov a écrit :
>> Just to be clear, you are favoring:
>>
>> Alternative:
>> 1. If paremeter is set, use that regardless of the rest
>> 2. if not set, obtain the content type
>> 3. Check whether is contains charset qualifier, yes use, use that
>> 4. If not check whether this is an HTML file and pass to JSoup (do magic)
>> 5. If nothing else works, assume UTF-8
>>
>> Michael
>>
>> Am 2014-11-14 um 19:09 schrieb Hervé BOUTEMY:
>>> I prefer the alternative
>>> and if no parameter is set, just keep it stupid simple: assume UTF-8
>>>
>>> IMHO, this will give good results and will be easy to explain
>>>
>>> anything more complex is harder to maintain and to explain in case magic
>>> does not do what was dreamt of
>>>
>>> Regards,
>>>
>>> Hervé
>>>
>>> Le vendredi 14 novembre 2014 18:43:02 Michael Osipov a écrit :
>>>> Am 2014-11-14 um 18:07 schrieb Hervé BOUTEMY:
>>>>> [..]
>>>>>
>>>>>> The parameter won't help if there are several licenses with several
>>>>>> encodings used.
>>>>>
>>>>> looks like the parameter can be either simple or complex: need a syntax
>>>>>
>>>>> or just ignore: is it theory or reality?
>>>>
>>>> Pure theory.
>>>>
>>>> My approach would be this:
>>>>
>>>> provide a license paramter: licenseEncoding
>>>>
>>>> 1. Obtain the content type
>>>> 2. Check whether is contains charset qualifier, yes use, use that
>>>> 3. If not check whether this is an HTML file and pass to JSoup (do magic)
>>>> 4. If nothing else can be determined use the parameter
>>>> 5. If paremeter is not set, assume UTF-8
>>>>
>>>> Alternative:
>>>>
>>>> 1. If paremeter is set, use that regardless of the rest
>>>> 2. If not, continue with first approach and omit 4
>>>>
>>>> WDYT?
>>>>
>>>> Michael
>>>>
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@maven.apache.org
>>>> For additional commands, e-mail: dev-help@maven.apache.org
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@maven.apache.org
>>> For additional commands, e-mail: dev-help@maven.apache.org
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@maven.apache.org
>> For additional commands, e-mail: dev-help@maven.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@maven.apache.org
> For additional commands, e-mail: dev-help@maven.apache.org
>
>



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@maven.apache.org
For additional commands, e-mail: dev-help@maven.apache.org


Re: What do if project.build.sourceEncoding is not provided?

Posted by Hervé BOUTEMY <he...@free.fr>.
more stupid simple:
1. If paremeter is set, use that regardless of the rest
2.  if not set, assume UTF-8

Regards,

Hervé

Le vendredi 14 novembre 2014 19:30:54 Michael Osipov a écrit :
> Just to be clear, you are favoring:
> 
> Alternative:
> 1. If paremeter is set, use that regardless of the rest
> 2. if not set, obtain the content type
> 3. Check whether is contains charset qualifier, yes use, use that
> 4. If not check whether this is an HTML file and pass to JSoup (do magic)
> 5. If nothing else works, assume UTF-8
> 
> Michael
> 
> Am 2014-11-14 um 19:09 schrieb Hervé BOUTEMY:
> > I prefer the alternative
> > and if no parameter is set, just keep it stupid simple: assume UTF-8
> > 
> > IMHO, this will give good results and will be easy to explain
> > 
> > anything more complex is harder to maintain and to explain in case magic
> > does not do what was dreamt of
> > 
> > Regards,
> > 
> > Hervé
> > 
> > Le vendredi 14 novembre 2014 18:43:02 Michael Osipov a écrit :
> >> Am 2014-11-14 um 18:07 schrieb Hervé BOUTEMY:
> >>> [..]
> >>> 
> >>>> The parameter won't help if there are several licenses with several
> >>>> encodings used.
> >>> 
> >>> looks like the parameter can be either simple or complex: need a syntax
> >>> 
> >>> or just ignore: is it theory or reality?
> >> 
> >> Pure theory.
> >> 
> >> My approach would be this:
> >> 
> >> provide a license paramter: licenseEncoding
> >> 
> >> 1. Obtain the content type
> >> 2. Check whether is contains charset qualifier, yes use, use that
> >> 3. If not check whether this is an HTML file and pass to JSoup (do magic)
> >> 4. If nothing else can be determined use the parameter
> >> 5. If paremeter is not set, assume UTF-8
> >> 
> >> Alternative:
> >> 
> >> 1. If paremeter is set, use that regardless of the rest
> >> 2. If not, continue with first approach and omit 4
> >> 
> >> WDYT?
> >> 
> >> Michael
> >> 
> >> 
> >> 
> >> 
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@maven.apache.org
> >> For additional commands, e-mail: dev-help@maven.apache.org
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@maven.apache.org
> > For additional commands, e-mail: dev-help@maven.apache.org
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@maven.apache.org
> For additional commands, e-mail: dev-help@maven.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@maven.apache.org
For additional commands, e-mail: dev-help@maven.apache.org


Re: What do if project.build.sourceEncoding is not provided?

Posted by Michael Osipov <mi...@apache.org>.
Just to be clear, you are favoring:

Alternative:
1. If paremeter is set, use that regardless of the rest
2. if not set, obtain the content type
3. Check whether is contains charset qualifier, yes use, use that
4. If not check whether this is an HTML file and pass to JSoup (do magic)
5. If nothing else works, assume UTF-8

Michael

Am 2014-11-14 um 19:09 schrieb Hervé BOUTEMY:
> I prefer the alternative
> and if no parameter is set, just keep it stupid simple: assume UTF-8
>
> IMHO, this will give good results and will be easy to explain
>
> anything more complex is harder to maintain and to explain in case magic does
> not do what was dreamt of
>
> Regards,
>
> Hervé
>
> Le vendredi 14 novembre 2014 18:43:02 Michael Osipov a écrit :
>> Am 2014-11-14 um 18:07 schrieb Hervé BOUTEMY:
>>> [..]
>>>
>>>> The parameter won't help if there are several licenses with several
>>>> encodings used.
>>>
>>> looks like the parameter can be either simple or complex: need a syntax
>>>
>>> or just ignore: is it theory or reality?
>>
>> Pure theory.
>>
>> My approach would be this:
>>
>> provide a license paramter: licenseEncoding
>>
>> 1. Obtain the content type
>> 2. Check whether is contains charset qualifier, yes use, use that
>> 3. If not check whether this is an HTML file and pass to JSoup (do magic)
>> 4. If nothing else can be determined use the parameter
>> 5. If paremeter is not set, assume UTF-8
>>
>> Alternative:
>>
>> 1. If paremeter is set, use that regardless of the rest
>> 2. If not, continue with first approach and omit 4
>>
>> WDYT?
>>
>> Michael
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@maven.apache.org
>> For additional commands, e-mail: dev-help@maven.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@maven.apache.org
> For additional commands, e-mail: dev-help@maven.apache.org
>
>



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@maven.apache.org
For additional commands, e-mail: dev-help@maven.apache.org


Re: What do if project.build.sourceEncoding is not provided?

Posted by Hervé BOUTEMY <he...@free.fr>.
I prefer the alternative
and if no parameter is set, just keep it stupid simple: assume UTF-8

IMHO, this will give good results and will be easy to explain

anything more complex is harder to maintain and to explain in case magic does 
not do what was dreamt of

Regards,

Hervé

Le vendredi 14 novembre 2014 18:43:02 Michael Osipov a écrit :
> Am 2014-11-14 um 18:07 schrieb Hervé BOUTEMY:
> > [..]
> > 
> >> The parameter won't help if there are several licenses with several
> >> encodings used.
> > 
> > looks like the parameter can be either simple or complex: need a syntax
> > 
> > or just ignore: is it theory or reality?
> 
> Pure theory.
> 
> My approach would be this:
> 
> provide a license paramter: licenseEncoding
> 
> 1. Obtain the content type
> 2. Check whether is contains charset qualifier, yes use, use that
> 3. If not check whether this is an HTML file and pass to JSoup (do magic)
> 4. If nothing else can be determined use the parameter
> 5. If paremeter is not set, assume UTF-8
> 
> Alternative:
> 
> 1. If paremeter is set, use that regardless of the rest
> 2. If not, continue with first approach and omit 4
> 
> WDYT?
> 
> Michael
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@maven.apache.org
> For additional commands, e-mail: dev-help@maven.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@maven.apache.org
For additional commands, e-mail: dev-help@maven.apache.org


Re: What do if project.build.sourceEncoding is not provided?

Posted by Michael Osipov <mi...@apache.org>.
Am 2014-11-14 um 18:07 schrieb Hervé BOUTEMY:
> [..]
>> The parameter won't help if there are several licenses with several
>> encodings used.
> looks like the parameter can be either simple or complex: need a syntax
>
> or just ignore: is it theory or reality?

Pure theory.

My approach would be this:

provide a license paramter: licenseEncoding

1. Obtain the content type
2. Check whether is contains charset qualifier, yes use, use that
3. If not check whether this is an HTML file and pass to JSoup (do magic)
4. If nothing else can be determined use the parameter
5. If paremeter is not set, assume UTF-8

Alternative:

1. If paremeter is set, use that regardless of the rest
2. If not, continue with first approach and omit 4

WDYT?

Michael




---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@maven.apache.org
For additional commands, e-mail: dev-help@maven.apache.org


Re: What do if project.build.sourceEncoding is not provided?

Posted by Hervé BOUTEMY <he...@free.fr>.
Le vendredi 14 novembre 2014 17:58:44 Michael Osipov a écrit :
> Am 2014-11-14 um 17:47 schrieb Hervé BOUTEMY:
> > since it is the encoding of a downloaded license, it has nothing to do
> > with
> > encoding of project sources: using ${project.build.sourceEncoding} is IMHO
> > wrong algorithm (which happen to give good results since a lot of people
> > use UTF-8)
> > 
> > then I'd go either for a parameter for the goal, or JSoup that does the
> > magic to detect effective content encoding
> 
> While this seems sound what about if the ressource is plain text and no
> encoding can be deduced?
true: our only bet is parameter

> 
> The parameter won't help if there are several licenses with several
> encodings used.
looks like the parameter can be either simple or complex: need a syntax

or just ignore: is it theory or reality?

> 
> > Le vendredi 14 novembre 2014 10:37:22 Michael Osipov a écrit :
> >> Am 2014-11-14 um 04:02 schrieb Kristian Rosenvold:
> >>> Isn't this handled by the content-type headers normally ?
> >> 
> >> No, for two reasons:
> >> 
> >> 1. The currect code does not inspect the content type
> >> 2. The server does send text/html but not the used encoding which is not
> >> necessary because it is located within the file itself
> >> 
> >> The only option would be inspect the content type header and make
> >> further assumptions.
> >> 
> >> Michael
> >> 
> >>> 2014-11-13 23:15 GMT+01:00 Michael Osipov <mi...@apache.org>:
> >>>> Hi folks,
> >>>> 
> >>>> I'd like to know if we have a general concensus on this:
> >>>> 
> >>>> I am investigating MPIR-242 and figured out the cause. The input stream
> >>>> is
> >>>> obtained from the HTTP URL and no encoding is given, so ISO-8859-1 is
> >>>> provided as default (yuck!). While I know that some reporting related
> >>>> modules have default values for input/output encoding, this contradicts
> >>>> our
> >>>> general approach to use platform encoding when
> >>>> project.build.sourceEncoding
> >>>> is not given.
> >>>> 
> >>>> In that special case, the behavior would be consistent if changed.
> >>>> Setting
> >>>> project.build.sourceEncoding to UTF-8 would solve the problem but is
> >>>> just
> >>>> a
> >>>> workaround. HTML resources carry their encoding with them but the
> >>>> ProjectInfoReportUtils treats everything as input streams (not helpful
> >>>> with
> >>>> XML/HTML). I would really like to avoid peeking with a pushback input
> >>>> stream.
> >>>> 
> >>>> How is your opinion on this?
> >>>> 
> >>>> I have two solutions in mind for the issue above:
> >>>> 
> >>>> 1. Easy: remove ISO-8859-1, assume platform encoding if
> >>>> project.build.sourceEncoding is not provided.
> >>>> 2. Complex: use an HTML parser (JSoup is awesome and license-compatible
> >>>> [1]) to get correctly encoded content.
> >>>> But how do you know that this URL really points to an HTML file and not
> >>>> a
> >>>> license.txt inspect content type?
> >>>> 
> >>>> [1] http://apache.org/legal/resolved.html#category-a
> >>>> 
> >>>> Michael
> >>>> 
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: dev-unsubscribe@maven.apache.org
> >>>> For additional commands, e-mail: dev-help@maven.apache.org
> >>> 
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: dev-unsubscribe@maven.apache.org
> >>> For additional commands, e-mail: dev-help@maven.apache.org
> >> 
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@maven.apache.org
> >> For additional commands, e-mail: dev-help@maven.apache.org
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@maven.apache.org
> > For additional commands, e-mail: dev-help@maven.apache.org
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@maven.apache.org
> For additional commands, e-mail: dev-help@maven.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@maven.apache.org
For additional commands, e-mail: dev-help@maven.apache.org


Re: What do if project.build.sourceEncoding is not provided?

Posted by Michael Osipov <mi...@apache.org>.
Am 2014-11-14 um 17:47 schrieb Hervé BOUTEMY:
> since it is the encoding of a downloaded license, it has nothing to do with
> encoding of project sources: using ${project.build.sourceEncoding} is IMHO
> wrong algorithm (which happen to give good results since a lot of people use
> UTF-8)
>
> then I'd go either for a parameter for the goal, or JSoup that does the magic
> to detect effective content encoding

While this seems sound what about if the ressource is plain text and no 
encoding can be deduced?

The parameter won't help if there are several licenses with several 
encodings used.

> Le vendredi 14 novembre 2014 10:37:22 Michael Osipov a écrit :
>> Am 2014-11-14 um 04:02 schrieb Kristian Rosenvold:
>>> Isn't this handled by the content-type headers normally ?
>>
>> No, for two reasons:
>>
>> 1. The currect code does not inspect the content type
>> 2. The server does send text/html but not the used encoding which is not
>> necessary because it is located within the file itself
>>
>> The only option would be inspect the content type header and make
>> further assumptions.
>>
>> Michael
>>
>>> 2014-11-13 23:15 GMT+01:00 Michael Osipov <mi...@apache.org>:
>>>> Hi folks,
>>>>
>>>> I'd like to know if we have a general concensus on this:
>>>>
>>>> I am investigating MPIR-242 and figured out the cause. The input stream
>>>> is
>>>> obtained from the HTTP URL and no encoding is given, so ISO-8859-1 is
>>>> provided as default (yuck!). While I know that some reporting related
>>>> modules have default values for input/output encoding, this contradicts
>>>> our
>>>> general approach to use platform encoding when
>>>> project.build.sourceEncoding
>>>> is not given.
>>>>
>>>> In that special case, the behavior would be consistent if changed.
>>>> Setting
>>>> project.build.sourceEncoding to UTF-8 would solve the problem but is just
>>>> a
>>>> workaround. HTML resources carry their encoding with them but the
>>>> ProjectInfoReportUtils treats everything as input streams (not helpful
>>>> with
>>>> XML/HTML). I would really like to avoid peeking with a pushback input
>>>> stream.
>>>>
>>>> How is your opinion on this?
>>>>
>>>> I have two solutions in mind for the issue above:
>>>>
>>>> 1. Easy: remove ISO-8859-1, assume platform encoding if
>>>> project.build.sourceEncoding is not provided.
>>>> 2. Complex: use an HTML parser (JSoup is awesome and license-compatible
>>>> [1]) to get correctly encoded content.
>>>> But how do you know that this URL really points to an HTML file and not a
>>>> license.txt inspect content type?
>>>>
>>>> [1] http://apache.org/legal/resolved.html#category-a
>>>>
>>>> Michael
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@maven.apache.org
>>>> For additional commands, e-mail: dev-help@maven.apache.org
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@maven.apache.org
>>> For additional commands, e-mail: dev-help@maven.apache.org
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@maven.apache.org
>> For additional commands, e-mail: dev-help@maven.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@maven.apache.org
> For additional commands, e-mail: dev-help@maven.apache.org
>
>



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@maven.apache.org
For additional commands, e-mail: dev-help@maven.apache.org


Re: What do if project.build.sourceEncoding is not provided?

Posted by Hervé BOUTEMY <he...@free.fr>.
since it is the encoding of a downloaded license, it has nothing to do with 
encoding of project sources: using ${project.build.sourceEncoding} is IMHO 
wrong algorithm (which happen to give good results since a lot of people use 
UTF-8)

then I'd go either for a parameter for the goal, or JSoup that does the magic 
to detect effective content encoding

Regards,

Hervé

Le vendredi 14 novembre 2014 10:37:22 Michael Osipov a écrit :
> Am 2014-11-14 um 04:02 schrieb Kristian Rosenvold:
> > Isn't this handled by the content-type headers normally ?
> 
> No, for two reasons:
> 
> 1. The currect code does not inspect the content type
> 2. The server does send text/html but not the used encoding which is not
> necessary because it is located within the file itself
> 
> The only option would be inspect the content type header and make
> further assumptions.
> 
> Michael
> 
> > 2014-11-13 23:15 GMT+01:00 Michael Osipov <mi...@apache.org>:
> >> Hi folks,
> >> 
> >> I'd like to know if we have a general concensus on this:
> >> 
> >> I am investigating MPIR-242 and figured out the cause. The input stream
> >> is
> >> obtained from the HTTP URL and no encoding is given, so ISO-8859-1 is
> >> provided as default (yuck!). While I know that some reporting related
> >> modules have default values for input/output encoding, this contradicts
> >> our
> >> general approach to use platform encoding when
> >> project.build.sourceEncoding
> >> is not given.
> >> 
> >> In that special case, the behavior would be consistent if changed.
> >> Setting
> >> project.build.sourceEncoding to UTF-8 would solve the problem but is just
> >> a
> >> workaround. HTML resources carry their encoding with them but the
> >> ProjectInfoReportUtils treats everything as input streams (not helpful
> >> with
> >> XML/HTML). I would really like to avoid peeking with a pushback input
> >> stream.
> >> 
> >> How is your opinion on this?
> >> 
> >> I have two solutions in mind for the issue above:
> >> 
> >> 1. Easy: remove ISO-8859-1, assume platform encoding if
> >> project.build.sourceEncoding is not provided.
> >> 2. Complex: use an HTML parser (JSoup is awesome and license-compatible
> >> [1]) to get correctly encoded content.
> >> But how do you know that this URL really points to an HTML file and not a
> >> license.txt inspect content type?
> >> 
> >> [1] http://apache.org/legal/resolved.html#category-a
> >> 
> >> Michael
> >> 
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@maven.apache.org
> >> For additional commands, e-mail: dev-help@maven.apache.org
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@maven.apache.org
> > For additional commands, e-mail: dev-help@maven.apache.org
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@maven.apache.org
> For additional commands, e-mail: dev-help@maven.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@maven.apache.org
For additional commands, e-mail: dev-help@maven.apache.org


Re: What do if project.build.sourceEncoding is not provided?

Posted by Michael Osipov <mi...@apache.org>.
Am 2014-11-14 um 04:02 schrieb Kristian Rosenvold:
> Isn't this handled by the content-type headers normally ?

No, for two reasons:

1. The currect code does not inspect the content type
2. The server does send text/html but not the used encoding which is not 
necessary because it is located within the file itself

The only option would be inspect the content type header and make 
further assumptions.

Michael

> 2014-11-13 23:15 GMT+01:00 Michael Osipov <mi...@apache.org>:
>> Hi folks,
>>
>> I'd like to know if we have a general concensus on this:
>>
>> I am investigating MPIR-242 and figured out the cause. The input stream is
>> obtained from the HTTP URL and no encoding is given, so ISO-8859-1 is
>> provided as default (yuck!). While I know that some reporting related
>> modules have default values for input/output encoding, this contradicts our
>> general approach to use platform encoding when project.build.sourceEncoding
>> is not given.
>>
>> In that special case, the behavior would be consistent if changed. Setting
>> project.build.sourceEncoding to UTF-8 would solve the problem but is just a
>> workaround. HTML resources carry their encoding with them but the
>> ProjectInfoReportUtils treats everything as input streams (not helpful with
>> XML/HTML). I would really like to avoid peeking with a pushback input
>> stream.
>>
>> How is your opinion on this?
>>
>> I have two solutions in mind for the issue above:
>>
>> 1. Easy: remove ISO-8859-1, assume platform encoding if
>> project.build.sourceEncoding is not provided.
>> 2. Complex: use an HTML parser (JSoup is awesome and license-compatible [1])
>> to get correctly encoded content.
>> But how do you know that this URL really points to an HTML file and not a
>> license.txt inspect content type?
>>
>> [1] http://apache.org/legal/resolved.html#category-a
>>
>> Michael
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@maven.apache.org
>> For additional commands, e-mail: dev-help@maven.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@maven.apache.org
> For additional commands, e-mail: dev-help@maven.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@maven.apache.org
For additional commands, e-mail: dev-help@maven.apache.org


Re: What do if project.build.sourceEncoding is not provided?

Posted by Kristian Rosenvold <kr...@gmail.com>.
Isn't this handled by the content-type headers normally ?

Kristian


2014-11-13 23:15 GMT+01:00 Michael Osipov <mi...@apache.org>:
> Hi folks,
>
> I'd like to know if we have a general concensus on this:
>
> I am investigating MPIR-242 and figured out the cause. The input stream is
> obtained from the HTTP URL and no encoding is given, so ISO-8859-1 is
> provided as default (yuck!). While I know that some reporting related
> modules have default values for input/output encoding, this contradicts our
> general approach to use platform encoding when project.build.sourceEncoding
> is not given.
>
> In that special case, the behavior would be consistent if changed. Setting
> project.build.sourceEncoding to UTF-8 would solve the problem but is just a
> workaround. HTML resources carry their encoding with them but the
> ProjectInfoReportUtils treats everything as input streams (not helpful with
> XML/HTML). I would really like to avoid peeking with a pushback input
> stream.
>
> How is your opinion on this?
>
> I have two solutions in mind for the issue above:
>
> 1. Easy: remove ISO-8859-1, assume platform encoding if
> project.build.sourceEncoding is not provided.
> 2. Complex: use an HTML parser (JSoup is awesome and license-compatible [1])
> to get correctly encoded content.
> But how do you know that this URL really points to an HTML file and not a
> license.txt inspect content type?
>
> [1] http://apache.org/legal/resolved.html#category-a
>
> Michael
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@maven.apache.org
> For additional commands, e-mail: dev-help@maven.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@maven.apache.org
For additional commands, e-mail: dev-help@maven.apache.org