You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@manifoldcf.apache.org by Karl Wright <da...@gmail.com> on 2018/11/15 11:57:09 UTC

Re: web connector : links extraction issues

Hi Olivier,

You can create a ticket but I don't have a good solution for you in any
case.

Karl


On Thu, Nov 15, 2018 at 6:53 AM Olivier Tavard <
olivier.tavard@francelabs.com> wrote:

> Hi Karl,
>
> Do you think that I need to create a Jira issue relative to this bug ie
> that the links extraction does not work if inside Javascript tags some code
> contain special characters like '>', '< '?
>
> Thanks,
> Best regards,
>
> Olivier
>
>
>
> Le 30 oct. 2018 à 12:05, Olivier Tavard <ol...@francelabs.com> a
> écrit :
>
> Hi Karl,
>
> Thanks for your answer.
> I kept looking into this and I found what was the problem. The Javascript
> code into the tags <script></scripts>  contained the character '<'. If so
> the links extraction does not work with the web connector.
>
> To reproduce it, I created this page hosted in local Apache then I indexed
> it with MCF 2.11 out of the box.
>
> in the first example the page was :
> <!DOCTYPE html>
>
> <head>
> <title>test</title>
> <meta charset="utf-8" />
> *<script type="text/javascript"></script>*
>
> </head>
> <body>
>
> <a href="https://manifoldcf.apache.org/en_US/index.html">manifoldcf</a>
> </body>
>
> The links extraction was correct, in the debug log :
> DEBUG 2018-10-30T11:46:12,584 (Worker thread '33') - WEB: Waiting for an
> HttpClient object
> DEBUG 2018-10-30T11:46:12,585 (Worker thread '33') - WEB: For
> http://localhost:8888/testjs/test.html, setting virtual host to localhost
> DEBUG 2018-10-30T11:46:12,585 (Worker thread '33') - WEB: Got an
> HttpClient object after 1 ms.
> DEBUG 2018-10-30T11:46:12,585 (Worker thread '33') - WEB: Get method for
> '/testjs/test.html'
>  INFO 2018-10-30T11:46:12,661 (Worker thread '33') - WEB: FETCH URL|
> http://localhost:8888/testjs/test.html|1540896372585+75|200|223|
> <http://localhost:8888/testjs/test.html%7C1540896372585+75%7C200%7C223%7C>
> DEBUG 2018-10-30T11:46:12,661 (Worker thread '33') - WEB: Document 'http://localhost:8888/testjs/test.html'
> is text, with encoding 'UTF-8'; link extraction starting
> DEBUG 2018-10-30T11:46:12,661 (Worker thread '33') - WEB: In html document
> 'http://localhost:8888/testjs/test.html', found link to
> 'https://manifoldcf.apache.org/en_US/index.html'
> DEBUG 2018-10-30T11:46:12,662 (Worker thread '33') - WEB: no content
> exclusion rule supplied... returning
> DEBUG 2018-10-30T11:46:12,662 (Worker thread '33') - WEB: Decided to
> ingest 'http://localhost:8888/testjs/test.html'
> —
> In the second example, the code was pretty quite the same except that I
> included the character '<' in the content of the script tags :
> <!DOCTYPE html>
>
> <head>
> <title>test</title>
> <meta charset="utf-8" />
> *<script type="text/javascript">a<b</script>*
>
> </head>
> <body>
>
>     <a href="https://manifoldcf.apache.org/en_US/index.html
> ">manifoldcf</a>
>
> </body>
>
> The links extraction was not successful, the debug log indicates :
> DEBUG 2018-10-30T11:48:13,474 (Worker thread '36') - WEB: Waiting for an
> HttpClient object
> DEBUG 2018-10-30T11:48:13,475 (Worker thread '36') - WEB: For
> http://localhost:8888/testjs/test.html, setting virtual host to localhost
> DEBUG 2018-10-30T11:48:13,475 (Worker thread '36') - WEB: Got an
> HttpClient object after 1 ms.
> DEBUG 2018-10-30T11:48:13,475 (Worker thread '36') - WEB: Get method for
> '/testjs/test.html'
>  INFO 2018-10-30T11:48:13,552 (Worker thread '36') - WEB: FETCH URL|
> http://localhost:8888/testjs/test.html|1540896493475+76|200|226|
> <http://localhost:8888/testjs/test.html%7C1540896493475+76%7C200%7C226%7C>
> DEBUG 2018-10-30T11:48:13,552 (Worker thread '36') - WEB: Document 'http://localhost:8888/testjs/test.html'
> is text, with encoding 'UTF-8'; link extraction starting
> DEBUG 2018-10-30T11:48:13,553 (Worker thread '36') - WEB: no content
> exclusion rule supplied... returning
> DEBUG 2018-10-30T11:48:13,553 (Worker thread '36') - WEB: Decided to
> ingest 'http://localhost:8888/testjs/test.html'
> —
> So special characters like the less than sign should be escaped in the
> code of the web connector to preserve the links extraction.
>
> Thanks,
> Best regards,
>
>
> Olivier
>
> Le 29 oct. 2018 à 19:39, Karl Wright <da...@gmail.com> a écrit :
>
> Hi Olivier,
>
> Javascript inclusion in the Web Connector is not evaluated.  In fact, no
> Javascript is executed at all.  Therefore it should not matter what is
> included via javascript.
>
> Thanks,
> Karl
>
>
> On Mon, Oct 29, 2018 at 1:39 PM Olivier Tavard <
> olivier.tavard@francelabs.com> wrote:
>
>> Hi,
>>
>> Regarding the web connector, I noticed that for specific websites, some
>> Javascript code can prevent the web connector to fetch correctly all the
>> links present on the page. Specifically, for websites that contain a
>> deprecated version of New relic web agent as
>> js-agent.newrelic.com/nr-1071.min.js.
>> After downloading the page locally and removing the reference to the new
>> relic agent browser, the links were correctly fetched in the page by the
>> web connector. So it seems that the Javascript injection here caused by
>> the new relic agent was the cause of the links not fetched in the page.
>> This case is rare and concerns only old versions of New Relic agent. But
>> in a more generic way, would it be possible to block the javascript
>> injection at the connector level during the indexation ?
>>
>> Thanks,
>> Best regards,
>> Olivier
>>
>>
>>
>
>

Re: web connector : links extraction issues

Posted by Karl Wright <da...@gmail.com>.
Hi Olivier,

The HTML parser built into MCF is quite resilient against badly formed
HTML, but there are limits.  Characters like "<" and ">" are used to denote
tags and thus they confuse the parser when they are present in unescaped
form.  It may be possible, with a fair bit of work, to handle some such
cases, but generally it is not possible to do this readily without a great
deal of work (and also knowledge that we're parsing HTML specifically, not
general XML).

So, in general, I think you should not expect ManifoldCF to be able to
handle whatever badly formed HTML you throw at it.  It's never going to be
as resilient as (say) Firefox in this regard.  It is much better to format
HTML properly in the first place.  You can verify this by using one of the
many available online XML validator tools available.

Thanks,
Karl


On Thu, Nov 15, 2018 at 7:22 AM Olivier Tavard <
olivier.tavard@francelabs.com> wrote:

> Hi Karl,
>
> Thanks for your answer.
> Could you detail your answer please ? Just to better understand : you mean
> that there is no chance that special characters could be escaped in the MCF
> code in this case ie the website needs to escape itself the special
> characters otherwise the extraction will not work in MCF, am I right ?
>
> Best regards,
>
> Olivier
>
>
>
> Le 15 nov. 2018 à 12:57, Karl Wright <da...@gmail.com> a écrit :
>
> Hi Olivier,
>
> You can create a ticket but I don't have a good solution for you in any
> case.
>
> Karl
>
>
> On Thu, Nov 15, 2018 at 6:53 AM Olivier Tavard <
> olivier.tavard@francelabs.com> wrote:
>
>> Hi Karl,
>>
>> Do you think that I need to create a Jira issue relative to this bug ie
>> that the links extraction does not work if inside Javascript tags some code
>> contain special characters like '>', '< '?
>>
>> Thanks,
>> Best regards,
>>
>> Olivier
>>
>>
>>
>> Le 30 oct. 2018 à 12:05, Olivier Tavard <ol...@francelabs.com>
>> a écrit :
>>
>> Hi Karl,
>>
>> Thanks for your answer.
>> I kept looking into this and I found what was the problem. The Javascript
>> code into the tags <script></scripts>  contained the character '<'. If so
>> the links extraction does not work with the web connector.
>>
>> To reproduce it, I created this page hosted in local Apache then I
>> indexed it with MCF 2.11 out of the box.
>>
>> in the first example the page was :
>> <!DOCTYPE html>
>>
>> <head>
>> <title>test</title>
>> <meta charset="utf-8" />
>> *<script type="text/javascript"></script>*
>>
>> </head>
>> <body>
>>
>> <a href="https://manifoldcf.apache.org/en_US/index.html">manifoldcf</a>
>> </body>
>>
>> The links extraction was correct, in the debug log :
>> DEBUG 2018-10-30T11:46:12,584 (Worker thread '33') - WEB: Waiting for an
>> HttpClient object
>> DEBUG 2018-10-30T11:46:12,585 (Worker thread '33') - WEB: For
>> http://localhost:8888/testjs/test.html, setting virtual host to localhost
>> DEBUG 2018-10-30T11:46:12,585 (Worker thread '33') - WEB: Got an
>> HttpClient object after 1 ms.
>> DEBUG 2018-10-30T11:46:12,585 (Worker thread '33') - WEB: Get method for
>> '/testjs/test.html'
>>  INFO 2018-10-30T11:46:12,661 (Worker thread '33') - WEB: FETCH URL|
>> http://localhost:8888/testjs/test.html|1540896372585+75|200|223|
>> <http://localhost:8888/testjs/test.html%7C1540896372585+75%7C200%7C223%7C>
>> DEBUG 2018-10-30T11:46:12,661 (Worker thread '33') - WEB: Document 'http://localhost:8888/testjs/test.html'
>> is text, with encoding 'UTF-8'; link extraction starting
>> DEBUG 2018-10-30T11:46:12,661 (Worker thread '33') - WEB: In html
>> document 'http://localhost:8888/testjs/test.html', found link to
>> 'https://manifoldcf.apache.org/en_US/index.html'
>> DEBUG 2018-10-30T11:46:12,662 (Worker thread '33') - WEB: no content
>> exclusion rule supplied... returning
>> DEBUG 2018-10-30T11:46:12,662 (Worker thread '33') - WEB: Decided to
>> ingest 'http://localhost:8888/testjs/test.html'
>> —
>> In the second example, the code was pretty quite the same except that I
>> included the character '<' in the content of the script tags :
>> <!DOCTYPE html>
>>
>> <head>
>> <title>test</title>
>> <meta charset="utf-8" />
>> *<script type="text/javascript">a<b</script>*
>>
>> </head>
>> <body>
>>
>>     <a href="https://manifoldcf.apache.org/en_US/index.html
>> ">manifoldcf</a>
>>
>> </body>
>>
>> The links extraction was not successful, the debug log indicates :
>> DEBUG 2018-10-30T11:48:13,474 (Worker thread '36') - WEB: Waiting for an
>> HttpClient object
>> DEBUG 2018-10-30T11:48:13,475 (Worker thread '36') - WEB: For
>> http://localhost:8888/testjs/test.html, setting virtual host to localhost
>> DEBUG 2018-10-30T11:48:13,475 (Worker thread '36') - WEB: Got an
>> HttpClient object after 1 ms.
>> DEBUG 2018-10-30T11:48:13,475 (Worker thread '36') - WEB: Get method for
>> '/testjs/test.html'
>>  INFO 2018-10-30T11:48:13,552 (Worker thread '36') - WEB: FETCH URL|
>> http://localhost:8888/testjs/test.html|1540896493475+76|200|226|
>> <http://localhost:8888/testjs/test.html%7C1540896493475+76%7C200%7C226%7C>
>> DEBUG 2018-10-30T11:48:13,552 (Worker thread '36') - WEB: Document 'http://localhost:8888/testjs/test.html'
>> is text, with encoding 'UTF-8'; link extraction starting
>> DEBUG 2018-10-30T11:48:13,553 (Worker thread '36') - WEB: no content
>> exclusion rule supplied... returning
>> DEBUG 2018-10-30T11:48:13,553 (Worker thread '36') - WEB: Decided to
>> ingest 'http://localhost:8888/testjs/test.html'
>> —
>> So special characters like the less than sign should be escaped in the
>> code of the web connector to preserve the links extraction.
>>
>> Thanks,
>> Best regards,
>>
>>
>> Olivier
>>
>> Le 29 oct. 2018 à 19:39, Karl Wright <da...@gmail.com> a écrit :
>>
>> Hi Olivier,
>>
>> Javascript inclusion in the Web Connector is not evaluated.  In fact, no
>> Javascript is executed at all.  Therefore it should not matter what is
>> included via javascript.
>>
>> Thanks,
>> Karl
>>
>>
>> On Mon, Oct 29, 2018 at 1:39 PM Olivier Tavard <
>> olivier.tavard@francelabs.com> wrote:
>>
>>> Hi,
>>>
>>> Regarding the web connector, I noticed that for specific websites, some
>>> Javascript code can prevent the web connector to fetch correctly all the
>>> links present on the page. Specifically, for websites that contain a
>>> deprecated version of New relic web agent as
>>> js-agent.newrelic.com/nr-1071.min.js.
>>> After downloading the page locally and removing the reference to the new
>>> relic agent browser, the links were correctly fetched in the page by
>>> the web connector. So it seems that the Javascript injection here
>>> caused by the new relic agent was the cause of the links not fetched in the
>>> page.
>>> This case is rare and concerns only old versions of New Relic agent. But
>>> in a more generic way, would it be possible to block the javascript
>>> injection at the connector level during the indexation ?
>>>
>>> Thanks,
>>> Best regards,
>>> Olivier
>>>
>>>
>>>
>>
>>

Re: web connector : links extraction issues

Posted by Olivier Tavard <ol...@francelabs.com>.
Hi Karl,

Thanks for your answer. 
Could you detail your answer please ? Just to better understand : you mean that there is no chance that special characters could be escaped in the MCF code in this case ie the website needs to escape itself the special characters otherwise the extraction will not work in MCF, am I right ?

Best regards,

Olivier



> Le 15 nov. 2018 à 12:57, Karl Wright <da...@gmail.com> a écrit :
> 
> Hi Olivier,
> 
> You can create a ticket but I don't have a good solution for you in any case.
> 
> Karl
> 
> 
>> On Thu, Nov 15, 2018 at 6:53 AM Olivier Tavard <ol...@francelabs.com> wrote:
>> Hi Karl,
>> 
>> Do you think that I need to create a Jira issue relative to this bug ie that the links extraction does not work if inside Javascript tags some code contain special characters like '>', '< '?
>> 
>> Thanks,
>> Best regards,
>> 
>> Olivier
>> 
>> 
>> 
>>> Le 30 oct. 2018 à 12:05, Olivier Tavard <ol...@francelabs.com> a écrit :
>>> 
>>> Hi Karl,
>>> 
>>> Thanks for your answer.
>>> I kept looking into this and I found what was the problem. The Javascript code into the tags <script></scripts>  contained the character '<'. If so the links extraction does not work with the web connector.
>>> 
>>> To reproduce it, I created this page hosted in local Apache then I indexed it with MCF 2.11 out of the box.
>>> 
>>> in the first example the page was :
>>> <!DOCTYPE html>
>>> 
>>> <head>
>>> <title>test</title>
>>> <meta charset="utf-8" />
>>> <script type="text/javascript"></script>
>>> 
>>> </head>
>>> <body>
>>> 
>>> <a href="https://manifoldcf.apache.org/en_US/index.html">manifoldcf</a>
>>> </body>
>>> 
>>> The links extraction was correct, in the debug log :
>>> DEBUG 2018-10-30T11:46:12,584 (Worker thread '33') - WEB: Waiting for an HttpClient object
>>> DEBUG 2018-10-30T11:46:12,585 (Worker thread '33') - WEB: For http://localhost:8888/testjs/test.html, setting virtual host to localhost
>>> DEBUG 2018-10-30T11:46:12,585 (Worker thread '33') - WEB: Got an HttpClient object after 1 ms.
>>> DEBUG 2018-10-30T11:46:12,585 (Worker thread '33') - WEB: Get method for '/testjs/test.html'
>>>  INFO 2018-10-30T11:46:12,661 (Worker thread '33') - WEB: FETCH URL|http://localhost:8888/testjs/test.html|1540896372585+75|200|223|
>>> DEBUG 2018-10-30T11:46:12,661 (Worker thread '33') - WEB: Document 'http://localhost:8888/testjs/test.html' is text, with encoding 'UTF-8'; link extraction starting
>>> DEBUG 2018-10-30T11:46:12,661 (Worker thread '33') - WEB: In html document 'http://localhost:8888/testjs/test.html', found link to 'https://manifoldcf.apache.org/en_US/index.html'
>>> DEBUG 2018-10-30T11:46:12,662 (Worker thread '33') - WEB: no content exclusion rule supplied... returning
>>> DEBUG 2018-10-30T11:46:12,662 (Worker thread '33') - WEB: Decided to ingest 'http://localhost:8888/testjs/test.html'
>>> —
>>> In the second example, the code was pretty quite the same except that I included the character '<' in the content of the script tags :
>>> <!DOCTYPE html>
>>> 
>>> <head>
>>> <title>test</title>
>>> <meta charset="utf-8" />
>>> <script type="text/javascript">a<b</script>
>>> 
>>> </head>
>>> <body>
>>> 
>>>     <a href="https://manifoldcf.apache.org/en_US/index.html">manifoldcf</a>
>>>     
>>> </body>
>>> 
>>> The links extraction was not successful, the debug log indicates :
>>> DEBUG 2018-10-30T11:48:13,474 (Worker thread '36') - WEB: Waiting for an HttpClient object
>>> DEBUG 2018-10-30T11:48:13,475 (Worker thread '36') - WEB: For http://localhost:8888/testjs/test.html, setting virtual host to localhost
>>> DEBUG 2018-10-30T11:48:13,475 (Worker thread '36') - WEB: Got an HttpClient object after 1 ms.
>>> DEBUG 2018-10-30T11:48:13,475 (Worker thread '36') - WEB: Get method for '/testjs/test.html'
>>>  INFO 2018-10-30T11:48:13,552 (Worker thread '36') - WEB: FETCH URL|http://localhost:8888/testjs/test.html|1540896493475+76|200|226|
>>> DEBUG 2018-10-30T11:48:13,552 (Worker thread '36') - WEB: Document 'http://localhost:8888/testjs/test.html' is text, with encoding 'UTF-8'; link extraction starting
>>> DEBUG 2018-10-30T11:48:13,553 (Worker thread '36') - WEB: no content exclusion rule supplied... returning
>>> DEBUG 2018-10-30T11:48:13,553 (Worker thread '36') - WEB: Decided to ingest 'http://localhost:8888/testjs/test.html'
>>> —
>>> So special characters like the less than sign should be escaped in the code of the web connector to preserve the links extraction.
>>> 
>>> Thanks,
>>> Best regards,
>>> 
>>> 
>>> Olivier 
>>> 
>>>> Le 29 oct. 2018 à 19:39, Karl Wright <da...@gmail.com> a écrit :
>>>> 
>>>> Hi Olivier,
>>>> 
>>>> Javascript inclusion in the Web Connector is not evaluated.  In fact, no Javascript is executed at all.  Therefore it should not matter what is included via javascript.
>>>> 
>>>> Thanks,
>>>> Karl
>>>> 
>>>> 
>>>>> On Mon, Oct 29, 2018 at 1:39 PM Olivier Tavard <ol...@francelabs.com> wrote:
>>>>> Hi,
>>>>> 
>>>>> Regarding the web connector, I noticed that for specific websites, some Javascript code can prevent the web connector to fetch correctly all the links present on the page. Specifically, for websites that contain a deprecated version of New relic web agent as js-agent.newrelic.com/nr-1071.min.js.
>>>>> After downloading the page locally and removing the reference to the new relic agent browser, the links were correctly fetched in the page by the web connector. So it seems that the Javascript injection here caused by the new relic agent was the cause of the links not fetched in the page.
>>>>> This case is rare and concerns only old versions of New Relic agent. But in a more generic way, would it be possible to block the javascript injection at the connector level during the indexation ?
>>>>>  
>>>>> Thanks,
>>>>> Best regards,
>>>>> Olivier 
>>>>> 
>>>>> 
>>> 
>>