You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by Brad Dennis <Br...@directsupply.com> on 2015/06/24 15:45:28 UTC

Webconnector: Comparison operator '<' in the body of a script tag

Hi,

There appears to be a bug in the TagParseState when the comparison operator '<'  is encountered in the body of  a script tag.  It appears to get flagged as an open tag and then the next '</' closes it.  In my case, the next '</' is the script tag.  The ScriptParseState chomps everything until it encounters a second </script> tag.

A live link that demonstrates this bug is here:
http://www.prnewswire.com/search-results/news/Google%252C%2520Inc.-30-days-page-1-pagesize-20

The '<' near line 2826 in the script body that begins near   line 2759 begins a new tag 'arraykeywords.length' which gets closed by the '</' in the closing script tag.  The ScriptParseState chomps all the html until it sees the end script tag near line 3385.

At the moment, I'm not sure of a solution other than pushing the script tag handling up to the TagParseState and treating it like CDATA is.


Thanks,

Brad Dennis



RE: Webconnector: Comparison operator '<' in the body of a script tag

Posted by Brad Dennis <Br...@directsupply.com>.
Karl,

The patch is working.  Thank you very much!  Also, thank you for your clarification on the behavior of the parser.  It's pretty complex.

Brad

-----Original Message-----
From: Karl Wright [mailto:daddywri@gmail.com] 
Sent: Wednesday, June 24, 2015 10:04 AM
To: dev
Subject: Re: Webconnector: Comparison operator '<' in the body of a script tag

Hi Brad,

I've attached a patch to the ticket:
https://issues.apache.org/jira/browse/CONNECTORS-1215 .  This patch merely tightens what the fuzzyml parser regards as a valid tag start, to adhere to the w3c specification.  I don't know whether browsers do it that way or not, but it should fix the specific page you included n your post.

Please let me know if you run into further difficulties with other pages; we can look at them one at a time.

Karl


On Wed, Jun 24, 2015 at 10:49 AM, Brad Dennis <Br...@directsupply.com>
wrote:

> Karl,
>
> Thank you for investigating the issue.  My concern is that I expect 
> it's fairly common to use '<' in embedded, uncommented, Javascript and 
> this bug excludes any content that appears after one and before a 
> second end script tag from being crawled with ManifoldCF.  
> Unfortunately, I don't have any suggestions other than using a stack 
> to push open tags onto and pop off when an end tag is seen.  I believe 
> that would satisfy your example, but who knows what other problems a stack brings.
>
> Do you have any suggestions for work arounds I could implement locally?
>
> Thanks,
> Brad
>
> -----Original Message-----
> From: Karl Wright [mailto:daddywri@gmail.com]
> Sent: Wednesday, June 24, 2015 9:33 AM
> To: dev
> Subject: Re: Webconnector: Comparison operator '<' in the body of a 
> script tag
>
> Brad,
>
> The issue is complex because according to spec the code is doing the 
> right thing.  Typically, <script> blocks look something like this:
>
> <script ...>
> <!--
>
> ...
>
> //-->
> </script>
>
> The reason for the comment area is because without it, tags within the 
> script block are supposed to be recognized as such, even if they are 
> ignored.  Within comments, this does not happen, of course, which is 
> why comments are used.
>
> I don't believe it is a real standard, but some browsers try to 
> interpret script blocks differently even when no comment is given.  We 
> can try to emulate that behavior but it is likely that our emulation 
> will not work for all web pages, since it's not a standard.  Exploring 
> how this works on various browsers would be the first step.  
> Specifically, if you do something like this:
>
> <script ...>
>
> foo = "<script></script>";
> bar = "hello";
>
> </script>
>
> ... what happens?  Does the script end at the first </script>, or the 
> second?  And, in what browsers?
>
> Until we get more clarity it's going to be hard to do a feature that 
> actually helps rather than hurts...
>
> Karl
>
>
> On Wed, Jun 24, 2015 at 10:05 AM, Karl Wright <da...@gmail.com> wrote:
>
> > Hi Brad,
> >
> > I've created a ticket: CONNECTORS-1215.  Looking into this now.
> >
> > Karl
> >
> >
> > On Wed, Jun 24, 2015 at 9:45 AM, Brad Dennis 
> > <Brad.Dennis@directsupply.com
> > > wrote:
> >
> >> Hi,
> >>
> >> There appears to be a bug in the TagParseState when the comparison 
> >> operator '<'  is encountered in the body of  a script tag.  It 
> >> appears to get flagged as an open tag and then the next '</' closes 
> >> it.  In my case, the next '</' is the script tag.  The 
> >> ScriptParseState chomps everything until it encounters a second
> </script> tag.
> >>
> >> A live link that demonstrates this bug is here:
> >>
> >> http://www.prnewswire.com/search-results/news/Google%252C%2520Inc.-
> >> 30
> >> -days-page-1-pagesize-20
> >>
> >> The '<' near line 2826 in the script body that begins near   line 2759
> >> begins a new tag 'arraykeywords.length' which gets closed by the '</'
> >> in the closing script tag.  The ScriptParseState chomps all the 
> >> html until it sees the end script tag near line 3385.
> >>
> >> At the moment, I'm not sure of a solution other than pushing the 
> >> script tag handling up to the TagParseState and treating it like 
> >> CDATA
> is.
> >>
> >>
> >> Thanks,
> >>
> >> Brad Dennis
> >>
> >>
> >>
> >
>

Re: Webconnector: Comparison operator '<' in the body of a script tag

Posted by Karl Wright <da...@gmail.com>.
"Unfortunately, I don't have any suggestions other than using a stack to
push open tags onto and pop off when an end tag is seen."  That's in fact
exactly what the current parser does, which is why it's actually
functioning mostly correctly AFAICT.

Karl


On Wed, Jun 24, 2015 at 11:03 AM, Karl Wright <da...@gmail.com> wrote:

> Hi Brad,
>
> I've attached a patch to the ticket:
> https://issues.apache.org/jira/browse/CONNECTORS-1215 .  This patch
> merely tightens what the fuzzyml parser regards as a valid tag start, to
> adhere to the w3c specification.  I don't know whether browsers do it that
> way or not, but it should fix the specific page you included n your post.
>
> Please let me know if you run into further difficulties with other pages;
> we can look at them one at a time.
>
> Karl
>
>
> On Wed, Jun 24, 2015 at 10:49 AM, Brad Dennis <
> Brad.Dennis@directsupply.com> wrote:
>
>> Karl,
>>
>> Thank you for investigating the issue.  My concern is that I expect it's
>> fairly common to use '<' in embedded, uncommented, Javascript and this bug
>> excludes any content that appears after one and before a second end script
>> tag from being crawled with ManifoldCF.  Unfortunately, I don't have any
>> suggestions other than using a stack to push open tags onto and pop off
>> when an end tag is seen.  I believe that would satisfy your example, but
>> who knows what other problems a stack brings.
>>
>> Do you have any suggestions for work arounds I could implement locally?
>>
>> Thanks,
>> Brad
>>
>> -----Original Message-----
>> From: Karl Wright [mailto:daddywri@gmail.com]
>> Sent: Wednesday, June 24, 2015 9:33 AM
>> To: dev
>> Subject: Re: Webconnector: Comparison operator '<' in the body of a
>> script tag
>>
>> Brad,
>>
>> The issue is complex because according to spec the code is doing the
>> right thing.  Typically, <script> blocks look something like this:
>>
>> <script ...>
>> <!--
>>
>> ...
>>
>> //-->
>> </script>
>>
>> The reason for the comment area is because without it, tags within the
>> script block are supposed to be recognized as such, even if they are
>> ignored.  Within comments, this does not happen, of course, which is why
>> comments are used.
>>
>> I don't believe it is a real standard, but some browsers try to interpret
>> script blocks differently even when no comment is given.  We can try to
>> emulate that behavior but it is likely that our emulation will not work for
>> all web pages, since it's not a standard.  Exploring how this works on
>> various browsers would be the first step.  Specifically, if you do
>> something like this:
>>
>> <script ...>
>>
>> foo = "<script></script>";
>> bar = "hello";
>>
>> </script>
>>
>> ... what happens?  Does the script end at the first </script>, or the
>> second?  And, in what browsers?
>>
>> Until we get more clarity it's going to be hard to do a feature that
>> actually helps rather than hurts...
>>
>> Karl
>>
>>
>> On Wed, Jun 24, 2015 at 10:05 AM, Karl Wright <da...@gmail.com> wrote:
>>
>> > Hi Brad,
>> >
>> > I've created a ticket: CONNECTORS-1215.  Looking into this now.
>> >
>> > Karl
>> >
>> >
>> > On Wed, Jun 24, 2015 at 9:45 AM, Brad Dennis
>> > <Brad.Dennis@directsupply.com
>> > > wrote:
>> >
>> >> Hi,
>> >>
>> >> There appears to be a bug in the TagParseState when the comparison
>> >> operator '<'  is encountered in the body of  a script tag.  It
>> >> appears to get flagged as an open tag and then the next '</' closes
>> >> it.  In my case, the next '</' is the script tag.  The
>> >> ScriptParseState chomps everything until it encounters a second
>> </script> tag.
>> >>
>> >> A live link that demonstrates this bug is here:
>> >>
>> >> http://www.prnewswire.com/search-results/news/Google%252C%2520Inc.-30
>> >> -days-page-1-pagesize-20
>> >>
>> >> The '<' near line 2826 in the script body that begins near   line 2759
>> >> begins a new tag 'arraykeywords.length' which gets closed by the '</'
>> >> in the closing script tag.  The ScriptParseState chomps all the html
>> >> until it sees the end script tag near line 3385.
>> >>
>> >> At the moment, I'm not sure of a solution other than pushing the
>> >> script tag handling up to the TagParseState and treating it like CDATA
>> is.
>> >>
>> >>
>> >> Thanks,
>> >>
>> >> Brad Dennis
>> >>
>> >>
>> >>
>> >
>>
>
>

Re: Webconnector: Comparison operator '<' in the body of a script tag

Posted by Karl Wright <da...@gmail.com>.
Hi Brad,

I've attached a patch to the ticket:
https://issues.apache.org/jira/browse/CONNECTORS-1215 .  This patch merely
tightens what the fuzzyml parser regards as a valid tag start, to adhere to
the w3c specification.  I don't know whether browsers do it that way or
not, but it should fix the specific page you included n your post.

Please let me know if you run into further difficulties with other pages;
we can look at them one at a time.

Karl


On Wed, Jun 24, 2015 at 10:49 AM, Brad Dennis <Br...@directsupply.com>
wrote:

> Karl,
>
> Thank you for investigating the issue.  My concern is that I expect it's
> fairly common to use '<' in embedded, uncommented, Javascript and this bug
> excludes any content that appears after one and before a second end script
> tag from being crawled with ManifoldCF.  Unfortunately, I don't have any
> suggestions other than using a stack to push open tags onto and pop off
> when an end tag is seen.  I believe that would satisfy your example, but
> who knows what other problems a stack brings.
>
> Do you have any suggestions for work arounds I could implement locally?
>
> Thanks,
> Brad
>
> -----Original Message-----
> From: Karl Wright [mailto:daddywri@gmail.com]
> Sent: Wednesday, June 24, 2015 9:33 AM
> To: dev
> Subject: Re: Webconnector: Comparison operator '<' in the body of a script
> tag
>
> Brad,
>
> The issue is complex because according to spec the code is doing the right
> thing.  Typically, <script> blocks look something like this:
>
> <script ...>
> <!--
>
> ...
>
> //-->
> </script>
>
> The reason for the comment area is because without it, tags within the
> script block are supposed to be recognized as such, even if they are
> ignored.  Within comments, this does not happen, of course, which is why
> comments are used.
>
> I don't believe it is a real standard, but some browsers try to interpret
> script blocks differently even when no comment is given.  We can try to
> emulate that behavior but it is likely that our emulation will not work for
> all web pages, since it's not a standard.  Exploring how this works on
> various browsers would be the first step.  Specifically, if you do
> something like this:
>
> <script ...>
>
> foo = "<script></script>";
> bar = "hello";
>
> </script>
>
> ... what happens?  Does the script end at the first </script>, or the
> second?  And, in what browsers?
>
> Until we get more clarity it's going to be hard to do a feature that
> actually helps rather than hurts...
>
> Karl
>
>
> On Wed, Jun 24, 2015 at 10:05 AM, Karl Wright <da...@gmail.com> wrote:
>
> > Hi Brad,
> >
> > I've created a ticket: CONNECTORS-1215.  Looking into this now.
> >
> > Karl
> >
> >
> > On Wed, Jun 24, 2015 at 9:45 AM, Brad Dennis
> > <Brad.Dennis@directsupply.com
> > > wrote:
> >
> >> Hi,
> >>
> >> There appears to be a bug in the TagParseState when the comparison
> >> operator '<'  is encountered in the body of  a script tag.  It
> >> appears to get flagged as an open tag and then the next '</' closes
> >> it.  In my case, the next '</' is the script tag.  The
> >> ScriptParseState chomps everything until it encounters a second
> </script> tag.
> >>
> >> A live link that demonstrates this bug is here:
> >>
> >> http://www.prnewswire.com/search-results/news/Google%252C%2520Inc.-30
> >> -days-page-1-pagesize-20
> >>
> >> The '<' near line 2826 in the script body that begins near   line 2759
> >> begins a new tag 'arraykeywords.length' which gets closed by the '</'
> >> in the closing script tag.  The ScriptParseState chomps all the html
> >> until it sees the end script tag near line 3385.
> >>
> >> At the moment, I'm not sure of a solution other than pushing the
> >> script tag handling up to the TagParseState and treating it like CDATA
> is.
> >>
> >>
> >> Thanks,
> >>
> >> Brad Dennis
> >>
> >>
> >>
> >
>

RE: Webconnector: Comparison operator '<' in the body of a script tag

Posted by Brad Dennis <Br...@directsupply.com>.
Karl,

Thank you for investigating the issue.  My concern is that I expect it's fairly common to use '<' in embedded, uncommented, Javascript and this bug excludes any content that appears after one and before a second end script tag from being crawled with ManifoldCF.  Unfortunately, I don't have any suggestions other than using a stack to push open tags onto and pop off when an end tag is seen.  I believe that would satisfy your example, but who knows what other problems a stack brings.

Do you have any suggestions for work arounds I could implement locally?

Thanks,
Brad

-----Original Message-----
From: Karl Wright [mailto:daddywri@gmail.com] 
Sent: Wednesday, June 24, 2015 9:33 AM
To: dev
Subject: Re: Webconnector: Comparison operator '<' in the body of a script tag

Brad,

The issue is complex because according to spec the code is doing the right thing.  Typically, <script> blocks look something like this:

<script ...>
<!--

...

//-->
</script>

The reason for the comment area is because without it, tags within the script block are supposed to be recognized as such, even if they are ignored.  Within comments, this does not happen, of course, which is why comments are used.

I don't believe it is a real standard, but some browsers try to interpret script blocks differently even when no comment is given.  We can try to emulate that behavior but it is likely that our emulation will not work for all web pages, since it's not a standard.  Exploring how this works on various browsers would be the first step.  Specifically, if you do something like this:

<script ...>

foo = "<script></script>";
bar = "hello";

</script>

... what happens?  Does the script end at the first </script>, or the second?  And, in what browsers?

Until we get more clarity it's going to be hard to do a feature that actually helps rather than hurts...

Karl


On Wed, Jun 24, 2015 at 10:05 AM, Karl Wright <da...@gmail.com> wrote:

> Hi Brad,
>
> I've created a ticket: CONNECTORS-1215.  Looking into this now.
>
> Karl
>
>
> On Wed, Jun 24, 2015 at 9:45 AM, Brad Dennis 
> <Brad.Dennis@directsupply.com
> > wrote:
>
>> Hi,
>>
>> There appears to be a bug in the TagParseState when the comparison 
>> operator '<'  is encountered in the body of  a script tag.  It 
>> appears to get flagged as an open tag and then the next '</' closes 
>> it.  In my case, the next '</' is the script tag.  The 
>> ScriptParseState chomps everything until it encounters a second </script> tag.
>>
>> A live link that demonstrates this bug is here:
>>
>> http://www.prnewswire.com/search-results/news/Google%252C%2520Inc.-30
>> -days-page-1-pagesize-20
>>
>> The '<' near line 2826 in the script body that begins near   line 2759
>> begins a new tag 'arraykeywords.length' which gets closed by the '</' 
>> in the closing script tag.  The ScriptParseState chomps all the html 
>> until it sees the end script tag near line 3385.
>>
>> At the moment, I'm not sure of a solution other than pushing the 
>> script tag handling up to the TagParseState and treating it like CDATA is.
>>
>>
>> Thanks,
>>
>> Brad Dennis
>>
>>
>>
>

Re: Webconnector: Comparison operator '<' in the body of a script tag

Posted by Karl Wright <da...@gmail.com>.
Brad,

The issue is complex because according to spec the code is doing the right
thing.  Typically, <script> blocks look something like this:

<script ...>
<!--

...

//-->
</script>

The reason for the comment area is because without it, tags within the
script block are supposed to be recognized as such, even if they are
ignored.  Within comments, this does not happen, of course, which is why
comments are used.

I don't believe it is a real standard, but some browsers try to interpret
script blocks differently even when no comment is given.  We can try to
emulate that behavior but it is likely that our emulation will not work for
all web pages, since it's not a standard.  Exploring how this works on
various browsers would be the first step.  Specifically, if you do
something like this:

<script ...>

foo = "<script></script>";
bar = "hello";

</script>

... what happens?  Does the script end at the first </script>, or the
second?  And, in what browsers?

Until we get more clarity it's going to be hard to do a feature that
actually helps rather than hurts...

Karl


On Wed, Jun 24, 2015 at 10:05 AM, Karl Wright <da...@gmail.com> wrote:

> Hi Brad,
>
> I've created a ticket: CONNECTORS-1215.  Looking into this now.
>
> Karl
>
>
> On Wed, Jun 24, 2015 at 9:45 AM, Brad Dennis <Brad.Dennis@directsupply.com
> > wrote:
>
>> Hi,
>>
>> There appears to be a bug in the TagParseState when the comparison
>> operator '<'  is encountered in the body of  a script tag.  It appears to
>> get flagged as an open tag and then the next '</' closes it.  In my case,
>> the next '</' is the script tag.  The ScriptParseState chomps everything
>> until it encounters a second </script> tag.
>>
>> A live link that demonstrates this bug is here:
>>
>> http://www.prnewswire.com/search-results/news/Google%252C%2520Inc.-30-days-page-1-pagesize-20
>>
>> The '<' near line 2826 in the script body that begins near   line 2759
>> begins a new tag 'arraykeywords.length' which gets closed by the '</' in
>> the closing script tag.  The ScriptParseState chomps all the html until it
>> sees the end script tag near line 3385.
>>
>> At the moment, I'm not sure of a solution other than pushing the script
>> tag handling up to the TagParseState and treating it like CDATA is.
>>
>>
>> Thanks,
>>
>> Brad Dennis
>>
>>
>>
>

Re: Webconnector: Comparison operator '<' in the body of a script tag

Posted by Karl Wright <da...@gmail.com>.
Hi Brad,

I've created a ticket: CONNECTORS-1215.  Looking into this now.

Karl


On Wed, Jun 24, 2015 at 9:45 AM, Brad Dennis <Br...@directsupply.com>
wrote:

> Hi,
>
> There appears to be a bug in the TagParseState when the comparison
> operator '<'  is encountered in the body of  a script tag.  It appears to
> get flagged as an open tag and then the next '</' closes it.  In my case,
> the next '</' is the script tag.  The ScriptParseState chomps everything
> until it encounters a second </script> tag.
>
> A live link that demonstrates this bug is here:
>
> http://www.prnewswire.com/search-results/news/Google%252C%2520Inc.-30-days-page-1-pagesize-20
>
> The '<' near line 2826 in the script body that begins near   line 2759
> begins a new tag 'arraykeywords.length' which gets closed by the '</' in
> the closing script tag.  The ScriptParseState chomps all the html until it
> sees the end script tag near line 3385.
>
> At the moment, I'm not sure of a solution other than pushing the script
> tag handling up to the TagParseState and treating it like CDATA is.
>
>
> Thanks,
>
> Brad Dennis
>
>
>