You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@commons.apache.org by Richard BUNEL <rb...@gmail.com> on 2022/05/10 07:48:30 UTC

[text] Hexadecimal characters in decimal numeric entities

Hi everyone,

We're having a debate (in the comment section of this PR
<https://github.com/apache/commons-text/pull/310>) on the legitimacy of
unescaping semicolon-less numerical character entities in Commons-Text.

The possibility to unescape such entities has long been part of the
library, via the semiColonOptional
<https://github.com/apache/commons-text/blob/master/src/main/java/org/apache/commons/text/translate/NumericEntityUnescaper.java#L48>
option
in the NumericEntityUnescaper class.

While testing this option, I discovered a small bug which allows to bypass
the unescaper.
A string like this: *<iframe src="&#106avascript:alert(1)">* is ignored by
the unescaper, because even though this entity is a decimal one, the
algorithm searches for hexidecimal characters in all cases and includes the
"a" after the "6".
This prompted me to fix it in this commit
<https://github.com/apache/commons-text/pull/310/commits/05280c2d474fce08bfb19cc2178949e5d384c999>
and
open the PR.

However, as mentioned earlier, there is a debate on the legitimacy of
unescaping semicolon-less from the beginning.

The point of garydgregory is that such entities do not form part of the
HTML specification and as such Commons-Text should not consider it.

My point and kinow's however, is that these semicolon-less entities are
unescaped by virtually every modern browsers (tested with Chrome, Firefox,
Edge and Safari) and that Commos-Text could reasonnably expect the library
to support them.

Also, I pointed that my fix only makes the unsecaping work correctly with
decimal entities, so in my opinion the PR shouldn't be blocked by the
debate.

What's your opinion about it ?

Thanks !
Richard

https://github.com/apache/commons-text/pull/310

Re: [text] Hexadecimal characters in decimal numeric entities

Posted by Richard BUNEL <rb...@gmail.com>.
Hi Bruno,

Yeah sorry for being impatient, hope I didn't bother anyone, it was not the
point.
I'm not familiar with how such projects are maintained, but happy to learn
more about it every day.
Thanks for the answer by the way.

Both solutions you mention seem right to me too.
I understand the need for a strict spec-following version to be the default
one.
However, I indeed support the semicolon-less alternative to be available
for developers wishing to use this option.

Let's take the time to discuss it ;-)

Best regards,
Richard



Le sam. 21 mai 2022 à 06:25, Bruno Kinoshita <ki...@apache.org> a écrit :

> Hi Richard,
>
> Thanks for the explanation and patience. Being a team of volunteers means
> that for some discussions and issues like this one we may take a while to
> find a solution/decision.
>
> I think the pull request/issue is valid since the code allows symbols
> without the semicolon to be escaped, but it is failing to do so with hex
> characters (which is what the PR is trying to fix), as Richard explained
> here and in the pull request.
>
> I would be +1 for deprecating the code and/or refactoring it too if others
> prefer it. Having a more strict version that follows a specification might
> be easier to maintain, or to have multiple solutions (one strict, another
> that doesn't require semicolon, etc), as was suggested in the PR review.
> Both sound like valid alternatives that are worth thinking about.
>
> However, I don't think we need to block that pull request until we have
> made a decision on that. I think we could fix this issue with no-semicolon
> and hex chars with the code from that pull request, and create a follow-up
> issue to discuss what to do about this code.
>
> Cheers
> -Bruno
>
> On Thu, 19 May 2022 at 04:19, Richard BUNEL <rb...@gmail.com> wrote:
>
> > Hello,
> >
> > Given the absence of answers, I thought I might add another argument to
> my
> > point.
> >
> > While I didn't mention it in my first post, please consider that the
> > semiColonOptional
> > <
> >
> https://github.com/apache/commons-text/blob/master/src/main/java/org/apache/commons/text/translate/NumericEntityUnescaper.java#L48
> > >
> > option
> > is never activated by default.
> > The constructor
> > <
> >
> https://github.com/apache/commons-text/blob/master/src/main/java/org/apache/commons/text/translate/NumericEntityUnescaper.java#L79
> > >
> > of
> > the NumericEntityUnescaper class uses by default the DEFAULT_OPTION
> > <
> >
> https://github.com/apache/commons-text/blob/master/src/main/java/org/apache/commons/text/translate/NumericEntityUnescaper.java#L57
> > >
> > which
> > is set to the semiColonRequired
> > <
> >
> https://github.com/apache/commons-text/blob/master/src/main/java/org/apache/commons/text/translate/NumericEntityUnescaper.java#L43
> > >
> >  value.
> >
> > Then, the UNESCAPE_HTML4
> > <
> >
> https://github.com/apache/commons-text/blob/master/src/main/java/org/apache/commons/text/StringEscapeUtils.java#L440
> > >
> > translator
> > in the StringEscapeUtils class, which is in turn used by the
> unescapeHtml4
> > <
> >
> https://github.com/apache/commons-text/blob/master/src/main/java/org/apache/commons/text/StringEscapeUtils.java#L791
> > >
> > method,
> > indeed calls the aforementioned constructor.
> > All of this is consistent with the fact that, by default, the Commons
> Text
> > library requires semicolons to be used after numeric HTML entities, and
> as
> > such follows strictly the HTML specification.
> >
> > However, the semiColonOptional
> > <
> >
> https://github.com/apache/commons-text/blob/master/src/main/java/org/apache/commons/text/translate/NumericEntityUnescaper.java#L48
> > >
> > option
> > does exist and may be used in certain ways. Therefore, I think we can
> > expect it to work correctly if used with decimal entities.
> > That's the point of my commit, nothing more than that.
> > I hope this little explanation might help you make your opinion.
> >
> > Thanks in advance,
> > Richard
> >
> > Le mar. 10 mai 2022 à 09:48, Richard BUNEL <rb...@gmail.com> a écrit
> :
> >
> > > Hi everyone,
> > >
> > > We're having a debate (in the comment section of this PR
> > > <https://github.com/apache/commons-text/pull/310>) on the legitimacy
> of
> > > unescaping semicolon-less numerical character entities in Commons-Text.
> > >
> > > The possibility to unescape such entities has long been part of the
> > > library, via the semiColonOptional
> > > <
> >
> https://github.com/apache/commons-text/blob/master/src/main/java/org/apache/commons/text/translate/NumericEntityUnescaper.java#L48
> >
> > option
> > > in the NumericEntityUnescaper class.
> > >
> > > While testing this option, I discovered a small bug which allows to
> > bypass
> > > the unescaper.
> > > A string like this: *<iframe src="&#106avascript:alert(1)">* is
> > > ignored by the unescaper, because even though this entity is a decimal
> > one,
> > > the algorithm searches for hexidecimal characters in all cases and
> > includes
> > > the "a" after the "6".
> > > This prompted me to fix it in this commit
> > > <
> >
> https://github.com/apache/commons-text/pull/310/commits/05280c2d474fce08bfb19cc2178949e5d384c999
> >
> > and
> > > open the PR.
> > >
> > > However, as mentioned earlier, there is a debate on the legitimacy of
> > > unescaping semicolon-less from the beginning.
> > >
> > > The point of garydgregory is that such entities do not form part of the
> > > HTML specification and as such Commons-Text should not consider it.
> > >
> > > My point and kinow's however, is that these semicolon-less entities are
> > > unescaped by virtually every modern browsers (tested with Chrome,
> > Firefox,
> > > Edge and Safari) and that Commos-Text could reasonnably expect the
> > library
> > > to support them.
> > >
> > > Also, I pointed that my fix only makes the unsecaping work correctly
> with
> > > decimal entities, so in my opinion the PR shouldn't be blocked by the
> > > debate.
> > >
> > > What's your opinion about it ?
> > >
> > > Thanks !
> > > Richard
> > >
> > > https://github.com/apache/commons-text/pull/310
> > >
> >
>

Re: [text] Hexadecimal characters in decimal numeric entities

Posted by Bruno Kinoshita <ki...@apache.org>.
Hi Richard,

Thanks for the explanation and patience. Being a team of volunteers means
that for some discussions and issues like this one we may take a while to
find a solution/decision.

I think the pull request/issue is valid since the code allows symbols
without the semicolon to be escaped, but it is failing to do so with hex
characters (which is what the PR is trying to fix), as Richard explained
here and in the pull request.

I would be +1 for deprecating the code and/or refactoring it too if others
prefer it. Having a more strict version that follows a specification might
be easier to maintain, or to have multiple solutions (one strict, another
that doesn't require semicolon, etc), as was suggested in the PR review.
Both sound like valid alternatives that are worth thinking about.

However, I don't think we need to block that pull request until we have
made a decision on that. I think we could fix this issue with no-semicolon
and hex chars with the code from that pull request, and create a follow-up
issue to discuss what to do about this code.

Cheers
-Bruno

On Thu, 19 May 2022 at 04:19, Richard BUNEL <rb...@gmail.com> wrote:

> Hello,
>
> Given the absence of answers, I thought I might add another argument to my
> point.
>
> While I didn't mention it in my first post, please consider that the
> semiColonOptional
> <
> https://github.com/apache/commons-text/blob/master/src/main/java/org/apache/commons/text/translate/NumericEntityUnescaper.java#L48
> >
> option
> is never activated by default.
> The constructor
> <
> https://github.com/apache/commons-text/blob/master/src/main/java/org/apache/commons/text/translate/NumericEntityUnescaper.java#L79
> >
> of
> the NumericEntityUnescaper class uses by default the DEFAULT_OPTION
> <
> https://github.com/apache/commons-text/blob/master/src/main/java/org/apache/commons/text/translate/NumericEntityUnescaper.java#L57
> >
> which
> is set to the semiColonRequired
> <
> https://github.com/apache/commons-text/blob/master/src/main/java/org/apache/commons/text/translate/NumericEntityUnescaper.java#L43
> >
>  value.
>
> Then, the UNESCAPE_HTML4
> <
> https://github.com/apache/commons-text/blob/master/src/main/java/org/apache/commons/text/StringEscapeUtils.java#L440
> >
> translator
> in the StringEscapeUtils class, which is in turn used by the unescapeHtml4
> <
> https://github.com/apache/commons-text/blob/master/src/main/java/org/apache/commons/text/StringEscapeUtils.java#L791
> >
> method,
> indeed calls the aforementioned constructor.
> All of this is consistent with the fact that, by default, the Commons Text
> library requires semicolons to be used after numeric HTML entities, and as
> such follows strictly the HTML specification.
>
> However, the semiColonOptional
> <
> https://github.com/apache/commons-text/blob/master/src/main/java/org/apache/commons/text/translate/NumericEntityUnescaper.java#L48
> >
> option
> does exist and may be used in certain ways. Therefore, I think we can
> expect it to work correctly if used with decimal entities.
> That's the point of my commit, nothing more than that.
> I hope this little explanation might help you make your opinion.
>
> Thanks in advance,
> Richard
>
> Le mar. 10 mai 2022 à 09:48, Richard BUNEL <rb...@gmail.com> a écrit :
>
> > Hi everyone,
> >
> > We're having a debate (in the comment section of this PR
> > <https://github.com/apache/commons-text/pull/310>) on the legitimacy of
> > unescaping semicolon-less numerical character entities in Commons-Text.
> >
> > The possibility to unescape such entities has long been part of the
> > library, via the semiColonOptional
> > <
> https://github.com/apache/commons-text/blob/master/src/main/java/org/apache/commons/text/translate/NumericEntityUnescaper.java#L48>
> option
> > in the NumericEntityUnescaper class.
> >
> > While testing this option, I discovered a small bug which allows to
> bypass
> > the unescaper.
> > A string like this: *<iframe src="&#106avascript:alert(1)">* is
> > ignored by the unescaper, because even though this entity is a decimal
> one,
> > the algorithm searches for hexidecimal characters in all cases and
> includes
> > the "a" after the "6".
> > This prompted me to fix it in this commit
> > <
> https://github.com/apache/commons-text/pull/310/commits/05280c2d474fce08bfb19cc2178949e5d384c999>
> and
> > open the PR.
> >
> > However, as mentioned earlier, there is a debate on the legitimacy of
> > unescaping semicolon-less from the beginning.
> >
> > The point of garydgregory is that such entities do not form part of the
> > HTML specification and as such Commons-Text should not consider it.
> >
> > My point and kinow's however, is that these semicolon-less entities are
> > unescaped by virtually every modern browsers (tested with Chrome,
> Firefox,
> > Edge and Safari) and that Commos-Text could reasonnably expect the
> library
> > to support them.
> >
> > Also, I pointed that my fix only makes the unsecaping work correctly with
> > decimal entities, so in my opinion the PR shouldn't be blocked by the
> > debate.
> >
> > What's your opinion about it ?
> >
> > Thanks !
> > Richard
> >
> > https://github.com/apache/commons-text/pull/310
> >
>

Re: [text] Hexadecimal characters in decimal numeric entities

Posted by Richard BUNEL <rb...@gmail.com>.
Hello,

Given the absence of answers, I thought I might add another argument to my
point.

While I didn't mention it in my first post, please consider that the
semiColonOptional
<https://github.com/apache/commons-text/blob/master/src/main/java/org/apache/commons/text/translate/NumericEntityUnescaper.java#L48>
option
is never activated by default.
The constructor
<https://github.com/apache/commons-text/blob/master/src/main/java/org/apache/commons/text/translate/NumericEntityUnescaper.java#L79>
of
the NumericEntityUnescaper class uses by default the DEFAULT_OPTION
<https://github.com/apache/commons-text/blob/master/src/main/java/org/apache/commons/text/translate/NumericEntityUnescaper.java#L57>
which
is set to the semiColonRequired
<https://github.com/apache/commons-text/blob/master/src/main/java/org/apache/commons/text/translate/NumericEntityUnescaper.java#L43>
 value.

Then, the UNESCAPE_HTML4
<https://github.com/apache/commons-text/blob/master/src/main/java/org/apache/commons/text/StringEscapeUtils.java#L440>
translator
in the StringEscapeUtils class, which is in turn used by the unescapeHtml4
<https://github.com/apache/commons-text/blob/master/src/main/java/org/apache/commons/text/StringEscapeUtils.java#L791>
method,
indeed calls the aforementioned constructor.
All of this is consistent with the fact that, by default, the Commons Text
library requires semicolons to be used after numeric HTML entities, and as
such follows strictly the HTML specification.

However, the semiColonOptional
<https://github.com/apache/commons-text/blob/master/src/main/java/org/apache/commons/text/translate/NumericEntityUnescaper.java#L48>
option
does exist and may be used in certain ways. Therefore, I think we can
expect it to work correctly if used with decimal entities.
That's the point of my commit, nothing more than that.
I hope this little explanation might help you make your opinion.

Thanks in advance,
Richard

Le mar. 10 mai 2022 à 09:48, Richard BUNEL <rb...@gmail.com> a écrit :

> Hi everyone,
>
> We're having a debate (in the comment section of this PR
> <https://github.com/apache/commons-text/pull/310>) on the legitimacy of
> unescaping semicolon-less numerical character entities in Commons-Text.
>
> The possibility to unescape such entities has long been part of the
> library, via the semiColonOptional
> <https://github.com/apache/commons-text/blob/master/src/main/java/org/apache/commons/text/translate/NumericEntityUnescaper.java#L48> option
> in the NumericEntityUnescaper class.
>
> While testing this option, I discovered a small bug which allows to bypass
> the unescaper.
> A string like this: *<iframe src="&#106avascript:alert(1)">* is
> ignored by the unescaper, because even though this entity is a decimal one,
> the algorithm searches for hexidecimal characters in all cases and includes
> the "a" after the "6".
> This prompted me to fix it in this commit
> <https://github.com/apache/commons-text/pull/310/commits/05280c2d474fce08bfb19cc2178949e5d384c999> and
> open the PR.
>
> However, as mentioned earlier, there is a debate on the legitimacy of
> unescaping semicolon-less from the beginning.
>
> The point of garydgregory is that such entities do not form part of the
> HTML specification and as such Commons-Text should not consider it.
>
> My point and kinow's however, is that these semicolon-less entities are
> unescaped by virtually every modern browsers (tested with Chrome, Firefox,
> Edge and Safari) and that Commos-Text could reasonnably expect the library
> to support them.
>
> Also, I pointed that my fix only makes the unsecaping work correctly with
> decimal entities, so in my opinion the PR shouldn't be blocked by the
> debate.
>
> What's your opinion about it ?
>
> Thanks !
> Richard
>
> https://github.com/apache/commons-text/pull/310
>