You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Marek Bachmann <m....@uni-kassel.de> on 2011/10/17 15:47:22 UTC

How does nutch handles javaScript in href

Hello List,

perhaps someone can give me a quick answer to this question:

Some pages have "<a>" tag with javaScript content e.g.:

<a 
href="javascript:linkTo_UnCryptMailto('nbjmup+ijxj.tuvecfsAvoj.lbttfm/ef');" 
class="mail">

How will this href be handled?

Thank you

Re: FOUND IT - How does nutch handles javaScript in href

Posted by Markus Jelsma <ma...@openindex.io>.

Not sure what JsParse is supposed to do in this situation but you should not 
use it anyway. It's not regarded as stable, just the protocolhttp.

> Ok, I went though the source, step by step.
> 
> It is the HtmlParserFilter called JSParseFilter. So it seems I have to
> exclude it from the plugin list.
> 
> 2011-10-19 17:33:46,031 DEBUG js.JSParseFilter -  - outlink from JS:
> 'http://www.uni-kassel.de/intranet/footernavi/typo3/ext/uk_solr_search//aut
> ocompletion/completer.php' 2011-10-19 17:33:46,041 DEBUG js.JSParseFilter -
>  - outlink from JS:
> 'http://www.uni-kassel.de/intranet/footernavi/nbjmup+jousbofuAvoj.lbttfm/e
> f' 2011-10-19 17:33:46,042 DEBUG js.JSParseFilter -  - outlink from JS:
> 'http://www.uni-kassel.de/intranet/footernavi/nbjmup+qptutufmmfAvoj.lbttfm
> /ef'
> 
> But its behaviour isn't right anyway? It shouldn't take this crypto
> string as an outlink?
> 
> On 19.10.2011 17:13, Markus Jelsma wrote:
> > Tika can do things a bit different. At least it did in the past and it
> > seems this is the case as well, i get 20 outlinks with Tika.
> > 
> >> One interesting thing I found out:
> >> 
> >> The HtmlParser Class tells me in debug mode (I had to replace the
> >> LOG.trace states through LOG.debug, since I don't know how to use these
> >> trace thing) that it had found 20 outlinks:
> >> 
> >> 2011-10-19 16:59:38,061 DEBUG parse.html - found 20 outlinks in
> >> http://www.uni-kassel.de/intranet/footernavi/redaktion.html
> >> 
> >> BUT the result of ParserChecker tells me there were 23 outlinks:
> >> 
> >> (...)
> >> Status: success(1,0)
> >> Title: Intranet: Redaktion
> >> Outlinks: 23
> >> 
> >>     outlink: toUrl:
> >> http://www.uni-kassel.de/intranet/footernavi/typo3/ext/uk_solr_search//a
> >> uto
> >> 
> >> completion/completer.php anchor:
> >>     outlink: toUrl:
> >> http://www.uni-kassel.de/intranet/footernavi/nbjmup+jousbofuAvoj.lbttfm/
> >> ef
> >> 
> >> anchor:
> >>     outlink: toUrl:
> >> http://www.uni-kassel.de/intranet/footernavi/nbjmup+qptutufmmfAvoj.lbttf
> >> m/e f anchor:
> >> (...)
> >> 
> >> This first three links are the ones which shouldn't be there. And the
> >> count is the difference between the output if ParserChecker and the
> >> debug log.
> >> 
> >> Seems these links doesn't come to the list through HtmlParser?
> >> 
> >> On 19.10.2011 16:24, Marek Bachmann wrote:
> >>> On 19.10.2011 16:00, lewis.mcgibbney@gmail.com wrote:
> >>>> Then in my own opinion there is no existing code within parse-html
> >>>> which prevents it from parsing the anchor snippts you've posted.
> >>> 
> >>> But something is happening with the content of the href attribute,
> >>> since in the source file its value is:
> >>> 
> >>> <a
> >>> href="javascript:linkTo_UnCryptMailto('nbjmup+jousbofuAvoj.lbttfm/ef');
> >>> " class="mail">
> >>> 
> >>> and after the parse it is just "nbjmup+jousbofuAvoj.lbttfm/ef" that
> >>> means, that the href value is handled somehow?!
> >>> 
> >>> I guess if nothing would be done with the href value then the outlink
> >>> value should be:
> >>> 
> >>> http://www.uni-kassel.de/intranet/footernavi/javascript:linkTo_UnCryptM
> >>> ai lto('nbjmup+jousbofuAvoj.lbttfm/ef');
> >>> 
> >>> 
> >>> Perhaps the java script gets evaluated somewhere but it fails because
> >>> the reference isn't found...
> >>> 
> >>> I'll look in the html parser to found more details.
> >>> 
> >>>> This would make a great addition to the parse-html as it seems to be
> >>>> an unforseen boundary case that we should not ignore.
> >>>> 
> >>>> If you don't get feedback on this, can I ask for you to open a JIRA
> >>>> ticket based upon your understanding of the situation?
> >>>> 
> >>>> Thank you

Re: FOUND IT - How does nutch handles javaScript in href

Posted by Marek Bachmann <m....@uni-kassel.de>.

Ok, I went though the source, step by step.

It is the HtmlParserFilter called JSParseFilter. So it seems I have to 
exclude it from the plugin list.

2011-10-19 17:33:46,031 DEBUG js.JSParseFilter -  - outlink from JS: 
'http://www.uni-kassel.de/intranet/footernavi/typo3/ext/uk_solr_search//autocompletion/completer.php'
2011-10-19 17:33:46,041 DEBUG js.JSParseFilter -  - outlink from JS: 
'http://www.uni-kassel.de/intranet/footernavi/nbjmup+jousbofuAvoj.lbttfm/ef'
2011-10-19 17:33:46,042 DEBUG js.JSParseFilter -  - outlink from JS: 
'http://www.uni-kassel.de/intranet/footernavi/nbjmup+qptutufmmfAvoj.lbttfm/ef'

But its behaviour isn't right anyway? It shouldn't take this crypto 
string as an outlink?

On 19.10.2011 17:13, Markus Jelsma wrote:
> Tika can do things a bit different. At least it did in the past and it seems
> this is the case as well, i get 20 outlinks with Tika.
>
>> One interesting thing I found out:
>>
>> The HtmlParser Class tells me in debug mode (I had to replace the
>> LOG.trace states through LOG.debug, since I don't know how to use these
>> trace thing) that it had found 20 outlinks:
>>
>> 2011-10-19 16:59:38,061 DEBUG parse.html - found 20 outlinks in
>> http://www.uni-kassel.de/intranet/footernavi/redaktion.html
>>
>> BUT the result of ParserChecker tells me there were 23 outlinks:
>>
>> (...)
>> Status: success(1,0)
>> Title: Intranet: Redaktion
>> Outlinks: 23
>>     outlink: toUrl:
>> http://www.uni-kassel.de/intranet/footernavi/typo3/ext/uk_solr_search//auto
>> completion/completer.php anchor:
>>     outlink: toUrl:
>> http://www.uni-kassel.de/intranet/footernavi/nbjmup+jousbofuAvoj.lbttfm/ef
>> anchor:
>>     outlink: toUrl:
>> http://www.uni-kassel.de/intranet/footernavi/nbjmup+qptutufmmfAvoj.lbttfm/e
>> f anchor:
>> (...)
>>
>> This first three links are the ones which shouldn't be there. And the
>> count is the difference between the output if ParserChecker and the
>> debug log.
>>
>> Seems these links doesn't come to the list through HtmlParser?
>>
>> On 19.10.2011 16:24, Marek Bachmann wrote:
>>> On 19.10.2011 16:00, lewis.mcgibbney@gmail.com wrote:
>>>> Then in my own opinion there is no existing code within parse-html which
>>>> prevents it from parsing the anchor snippts you've posted.
>>>
>>> But something is happening with the content of the href attribute, since
>>> in the source file its value is:
>>>
>>> <a
>>> href="javascript:linkTo_UnCryptMailto('nbjmup+jousbofuAvoj.lbttfm/ef');"
>>> class="mail">
>>>
>>> and after the parse it is just "nbjmup+jousbofuAvoj.lbttfm/ef" that
>>> means, that the href value is handled somehow?!
>>>
>>> I guess if nothing would be done with the href value then the outlink
>>> value should be:
>>>
>>> http://www.uni-kassel.de/intranet/footernavi/javascript:linkTo_UnCryptMai
>>> lto('nbjmup+jousbofuAvoj.lbttfm/ef');
>>>
>>>
>>> Perhaps the java script gets evaluated somewhere but it fails because
>>> the reference isn't found...
>>>
>>> I'll look in the html parser to found more details.
>>>
>>>> This would make a great addition to the parse-html as it seems to be an
>>>> unforseen boundary case that we should not ignore.
>>>>
>>>> If you don't get feedback on this, can I ask for you to open a JIRA
>>>> ticket based upon your understanding of the situation?
>>>>
>>>> Thank you

Re: How does nutch handles javaScript in href

Posted by Markus Jelsma <ma...@openindex.io>.

Tika can do things a bit different. At least it did in the past and it seems 
this is the case as well, i get 20 outlinks with Tika.

> One interesting thing I found out:
> 
> The HtmlParser Class tells me in debug mode (I had to replace the
> LOG.trace states through LOG.debug, since I don't know how to use these
> trace thing) that it had found 20 outlinks:
> 
> 2011-10-19 16:59:38,061 DEBUG parse.html - found 20 outlinks in
> http://www.uni-kassel.de/intranet/footernavi/redaktion.html
> 
> BUT the result of ParserChecker tells me there were 23 outlinks:
> 
> (...)
> Status: success(1,0)
> Title: Intranet: Redaktion
> Outlinks: 23
>    outlink: toUrl:
> http://www.uni-kassel.de/intranet/footernavi/typo3/ext/uk_solr_search//auto
> completion/completer.php anchor:
>    outlink: toUrl:
> http://www.uni-kassel.de/intranet/footernavi/nbjmup+jousbofuAvoj.lbttfm/ef
> anchor:
>    outlink: toUrl:
> http://www.uni-kassel.de/intranet/footernavi/nbjmup+qptutufmmfAvoj.lbttfm/e
> f anchor:
> (...)
> 
> This first three links are the ones which shouldn't be there. And the
> count is the difference between the output if ParserChecker and the
> debug log.
> 
> Seems these links doesn't come to the list through HtmlParser?
> 
> On 19.10.2011 16:24, Marek Bachmann wrote:
> > On 19.10.2011 16:00, lewis.mcgibbney@gmail.com wrote:
> >> Then in my own opinion there is no existing code within parse-html which
> >> prevents it from parsing the anchor snippts you've posted.
> > 
> > But something is happening with the content of the href attribute, since
> > in the source file its value is:
> > 
> > <a
> > href="javascript:linkTo_UnCryptMailto('nbjmup+jousbofuAvoj.lbttfm/ef');"
> > class="mail">
> > 
> > and after the parse it is just "nbjmup+jousbofuAvoj.lbttfm/ef" that
> > means, that the href value is handled somehow?!
> > 
> > I guess if nothing would be done with the href value then the outlink
> > value should be:
> > 
> > http://www.uni-kassel.de/intranet/footernavi/javascript:linkTo_UnCryptMai
> > lto('nbjmup+jousbofuAvoj.lbttfm/ef');
> > 
> > 
> > Perhaps the java script gets evaluated somewhere but it fails because
> > the reference isn't found...
> > 
> > I'll look in the html parser to found more details.
> > 
> >> This would make a great addition to the parse-html as it seems to be an
> >> unforseen boundary case that we should not ignore.
> >> 
> >> If you don't get feedback on this, can I ask for you to open a JIRA
> >> ticket based upon your understanding of the situation?
> >> 
> >> Thank you

Re: How does nutch handles javaScript in href

Posted by Marek Bachmann <m....@uni-kassel.de>.

One interesting thing I found out:

The HtmlParser Class tells me in debug mode (I had to replace the 
LOG.trace states through LOG.debug, since I don't know how to use these 
trace thing) that it had found 20 outlinks:

2011-10-19 16:59:38,061 DEBUG parse.html - found 20 outlinks in 
http://www.uni-kassel.de/intranet/footernavi/redaktion.html

BUT the result of ParserChecker tells me there were 23 outlinks:

(...)
Status: success(1,0)
Title: Intranet: Redaktion
Outlinks: 23
   outlink: toUrl: 
http://www.uni-kassel.de/intranet/footernavi/typo3/ext/uk_solr_search//autocompletion/completer.php 
anchor:
   outlink: toUrl: 
http://www.uni-kassel.de/intranet/footernavi/nbjmup+jousbofuAvoj.lbttfm/ef 
anchor:
   outlink: toUrl: 
http://www.uni-kassel.de/intranet/footernavi/nbjmup+qptutufmmfAvoj.lbttfm/ef 
anchor:
(...)

This first three links are the ones which shouldn't be there. And the 
count is the difference between the output if ParserChecker and the 
debug log.

Seems these links doesn't come to the list through HtmlParser?

On 19.10.2011 16:24, Marek Bachmann wrote:
> On 19.10.2011 16:00, lewis.mcgibbney@gmail.com wrote:
>> Then in my own opinion there is no existing code within parse-html which
>> prevents it from parsing the anchor snippts you've posted.
>
> But something is happening with the content of the href attribute, since
> in the source file its value is:
>
> <a
> href="javascript:linkTo_UnCryptMailto('nbjmup+jousbofuAvoj.lbttfm/ef');"
> class="mail">
>
> and after the parse it is just "nbjmup+jousbofuAvoj.lbttfm/ef" that
> means, that the href value is handled somehow?!
>
> I guess if nothing would be done with the href value then the outlink
> value should be:
>
> http://www.uni-kassel.de/intranet/footernavi/javascript:linkTo_UnCryptMailto('nbjmup+jousbofuAvoj.lbttfm/ef');
>
>
> Perhaps the java script gets evaluated somewhere but it fails because
> the reference isn't found...
>
> I'll look in the html parser to found more details.
>
>>
>> This would make a great addition to the parse-html as it seems to be an
>> unforseen boundary case that we should not ignore.
>>
>> If you don't get feedback on this, can I ask for you to open a JIRA
>> ticket based upon your understanding of the situation?
>>
>> Thank you
>>
>

Re: How does nutch handles javaScript in href

Posted by Marek Bachmann <m....@uni-kassel.de>.

On 19.10.2011 16:00, lewis.mcgibbney@gmail.com wrote:
> Then in my own opinion there is no existing code within parse-html which
> prevents it from parsing the anchor snippts you've posted.

But something is happening with the content of the href attribute, since 
in the source file its value is:

<a 
href="javascript:linkTo_UnCryptMailto('nbjmup+jousbofuAvoj.lbttfm/ef');" 
class="mail">

and after the parse it is just "nbjmup+jousbofuAvoj.lbttfm/ef" that 
means, that the href value is handled somehow?!

I guess if nothing would be done with the href value then the outlink 
value should be:

http://www.uni-kassel.de/intranet/footernavi/javascript:linkTo_UnCryptMailto('nbjmup+jousbofuAvoj.lbttfm/ef');

Perhaps the java script gets evaluated somewhere but it fails because 
the reference isn't found...

I'll look in the html parser to found more details.

>
> This would make a great addition to the parse-html as it seems to be an
> unforseen boundary case that we should not ignore.
>
> If you don't get feedback on this, can I ask for you to open a JIRA
> ticket based upon your understanding of the situation?
>
> Thank you
>

Re: Re: How does nutch handles javaScript in href

Posted by le...@gmail.com.

Then in my own opinion there is no existing code within parse-html which  
prevents it from parsing the anchor snippts you've posted.

This would make a great addition to the parse-html as it seems to be an  
unforseen boundary case that we should not ignore.

If you don't get feedback on this, can I ask for you to open a JIRA ticket  
based upon your understanding of the situation?

Thank you

On , Marek Bachmann <m....@uni-kassel.de> wrote:
> On 19.10.2011 14:34, lewis john mcgibbney wrote:


> Hi Marek,



> This is v. interesting and I am looking forward to hearing from anyone  
> with

> similar problems. Unfortunately I've not experienced this behaviour,  
> however

> it is clearly a significant problem as you point out. Ultimately it should

> be ironed out.



> What a great tool the ParserChecker is.



> 11/10/19 13:58:05 INFO parse.ParserChecker: parsing:


> http://www.uni-kassel.de/intranet/footernavi/redaktion.html

> 11/10/19 13:58:05 INFO parse.ParserChecker: contentType:

> application/xhtml+xml

> 11/10/19 13:58:05 INFO conf.Configuration: found resource  
> parse-plugins.xml

> at file:/tmp/hadoop-nutch/hadoop-**unjar8228180125857982003/**

> parse-plugins.xml

> 11/10/19 13:58:05 WARN parse.ParserFactory: ParserFactory:Plugin:

> org.apache.nutch.parse.html.HtmlParser mapped to contentType

> application/xhtml+xml via parse-plugins.xml, but its plugin.xml file does

> not claim to support contentType: application/xhtml+xml






> This indicates that parse-html was not used and the default for wildcard

> contentType defaults to parse-tika... am I correct here?




> According to my parse-plugins.xml, yes:




> if it can't be determined, use parse-tika -->









> BUT:



> I added LOG.info("This is HtmlParser"); to the first line in getParse in  
> HtmlParser.java and compiled it. After that I got:



> (...)

> 11/10/19 15:20:08 WARN parse.ParserFactory: ParserFactory:Plugin:  
> org.apache.nutch.parse.html.HtmlParser mapped to contentType  
> application/xhtml+xml via parse-plugins.xml, but its plugin.xml file does  
> not claim to support contentType: application/xhtml+xml



> 11/10/19 15:20:08 INFO parse.html: This is HtmlParser



> ---------

> Url

> ---------------

> http://www.uni-kassel.de/intranet/footernavi/redaktion.html---------

> ParseData

> ---------

> Version: 5

> Status: success(1,0)

> Title: Intranet: Redaktion

> Outlinks: 23

> outlink: toUrl:  
> http://www.uni-kassel.de/intranet/footernavi/typo3/ext/uk_solr_search//autocompletion/completer.php  
> anchor:

> outlink: toUrl:  
> http://www.uni-kassel.de/intranet/footernavi/nbjmup+jousbofuAvoj.lbttfm/ef  
> anchor:

> (...)



> As I understand this, the HtmlParser IS used and NOT Tika?








> If this is the case then it means that parse-tika is not dealing with the

> problem as you describe it. However I must also comment, that we recently

> committed Ferdy's NUTCH-1097 for trunk-1.4 which meant that parse-html  
> dealt

> with application/xhtml+xml material. It would be interesting to see if

> parse-html in trunk-1.4 deals with this now. If not then I think this  
> needs

> to be filed as a JIRA issue and dealt with appropriately.



> Can you please check and get back to us...



> Thanks



> Lewis

Re: How does nutch handles javaScript in href

Posted by Marek Bachmann <m....@uni-kassel.de>.

On 19.10.2011 14:34, lewis john mcgibbney wrote:
> Hi Marek,
>
> This is v. interesting and I am looking forward to hearing from anyone with
> similar problems. Unfortunately I've not experienced this behaviour, however
> it is clearly a significant problem as you point out. Ultimately it should
> be ironed out.
>
> What a great tool the ParserChecker is.
>
> 11/10/19 13:58:05 INFO parse.ParserChecker: parsing:
>> http://www.uni-kassel.de/intranet/footernavi/redaktion.html
>> 11/10/19 13:58:05 INFO parse.ParserChecker: contentType:
>> application/xhtml+xml
>> 11/10/19 13:58:05 INFO conf.Configuration: found resource parse-plugins.xml
>> at file:/tmp/hadoop-nutch/hadoop-**unjar8228180125857982003/**
>> parse-plugins.xml
>> 11/10/19 13:58:05 WARN parse.ParserFactory: ParserFactory:Plugin:
>> org.apache.nutch.parse.html.HtmlParser mapped to contentType
>> application/xhtml+xml via parse-plugins.xml, but its plugin.xml file does
>> not claim to support contentType: application/xhtml+xml
>>
>
> This indicates that parse-html was not used and the default for wildcard
> contentType defaults to parse-tika... am I correct here?

According to my parse-plugins.xml, yes:

   <!--  by default if the mimeType is set to *, or
         if it can't be determined, use parse-tika -->
	<mimeType name="*">
	  <plugin id="parse-tika" />
	</mimeType>

BUT:

I added LOG.info("This is HtmlParser"); to the first line in getParse in 
HtmlParser.java and compiled it. After that I got:

(...)
11/10/19 15:20:08 WARN parse.ParserFactory: ParserFactory:Plugin: 
org.apache.nutch.parse.html.HtmlParser mapped to contentType 
application/xhtml+xml via parse-plugins.xml, but its plugin.xml file 
does not claim to support contentType: application/xhtml+xml

11/10/19 15:20:08 INFO parse.html: This is HtmlParser

---------
Url
---------------
http://www.uni-kassel.de/intranet/footernavi/redaktion.html---------
ParseData
---------
Version: 5
Status: success(1,0)
Title: Intranet: Redaktion
Outlinks: 23
   outlink: toUrl: 
http://www.uni-kassel.de/intranet/footernavi/typo3/ext/uk_solr_search//autocompletion/completer.php 
anchor:
   outlink: toUrl: 
http://www.uni-kassel.de/intranet/footernavi/nbjmup+jousbofuAvoj.lbttfm/ef 
anchor:
(...)

As I understand this, the HtmlParser IS used and NOT Tika?



> If this is the case then it means that parse-tika is not dealing with the
> problem as you describe it. However I must also comment, that we recently
> committed Ferdy's NUTCH-1097 for trunk-1.4 which meant that parse-html dealt
> with application/xhtml+xml material. It would be interesting to see if
> parse-html in trunk-1.4 deals with this now. If not then I think this needs
> to be filed as a JIRA issue and dealt with appropriately.
>
> Can you please check and get back to us...
>
> Thanks
>
> Lewis
>

Re: How does nutch handles javaScript in href

Posted by lewis john mcgibbney <le...@gmail.com>.

Hi Marek,

This is v. interesting and I am looking forward to hearing from anyone with
similar problems. Unfortunately I've not experienced this behaviour, however
it is clearly a significant problem as you point out. Ultimately it should
be ironed out.

What a great tool the ParserChecker is.

11/10/19 13:58:05 INFO parse.ParserChecker: parsing:
> http://www.uni-kassel.de/intranet/footernavi/redaktion.html
> 11/10/19 13:58:05 INFO parse.ParserChecker: contentType:
> application/xhtml+xml
> 11/10/19 13:58:05 INFO conf.Configuration: found resource parse-plugins.xml
> at file:/tmp/hadoop-nutch/hadoop-**unjar8228180125857982003/**
> parse-plugins.xml
> 11/10/19 13:58:05 WARN parse.ParserFactory: ParserFactory:Plugin:
> org.apache.nutch.parse.html.HtmlParser mapped to contentType
> application/xhtml+xml via parse-plugins.xml, but its plugin.xml file does
> not claim to support contentType: application/xhtml+xml
>

This indicates that parse-html was not used and the default for wildcard
contentType defaults to parse-tika... am I correct here?

If this is the case then it means that parse-tika is not dealing with the
problem as you describe it. However I must also comment, that we recently
committed Ferdy's NUTCH-1097 for trunk-1.4 which meant that parse-html dealt
with application/xhtml+xml material. It would be interesting to see if
parse-html in trunk-1.4 deals with this now. If not then I think this needs
to be filed as a JIRA issue and dealt with appropriately.

Can you please check and get back to us...

Thanks

Lewis

Re: How does nutch handles javaScript in href

Posted by Marek Bachmann <m....@uni-kassel.de>.

So, I figured out, that they are not discarded.

Let's take this URL for example:

http://www.uni-kassel.de/intranet/footernavi/nbjmup+qptutufmmfAvoj.lbttfm/ef

This page is not found. I used the linkdb to determine why this deadlink 
is in the crawldb. The result:

./nutch readlinkdb linkdb -url 
"http://www.uni-kassel.de/intranet/footernavi/nbjmup+qptutufmmfAvoj.lbttfm/ef"
11/10/19 01:29:52 INFO util.NativeCodeLoader: Loaded the native-hadoop 
library
11/10/19 01:29:52 INFO zlib.ZlibFactory: Successfully loaded & 
initialized native-zlib library
11/10/19 01:29:52 INFO compress.CodecPool: Got brand-new decompressor
11/10/19 01:29:52 INFO compress.CodecPool: Got brand-new decompressor
11/10/19 01:29:52 INFO compress.CodecPool: Got brand-new decompressor
11/10/19 01:29:52 INFO compress.CodecPool: Got brand-new decompressor
11/10/19 01:29:52 INFO compress.CodecPool: Got brand-new decompressor
11/10/19 01:29:52 INFO compress.CodecPool: Got brand-new decompressor
11/10/19 01:29:52 INFO compress.CodecPool: Got brand-new decompressor
11/10/19 01:29:52 INFO compress.CodecPool: Got brand-new decompressor
11/10/19 01:29:52 INFO compress.CodecPool: Got brand-new decompressor
11/10/19 01:29:52 INFO compress.CodecPool: Got brand-new decompressor
fromUrl: http://www.uni-kassel.de/intranet/footernavi/redaktion.html anchor:
fromUrl: http://www.uni-kassel.de/intranet/footernavi/bildnachweis.html 
anchor:
fromUrl: http://www.uni-kassel.de/intranet/footernavi/sitemap.html anchor:

I took the first page 
http://www.uni-kassel.de/intranet/footernavi/redaktion.html and run 
ParserChecker on it. This is the result:

./nutch org.apache.nutch.parse.ParserChecker 
"http://www.uni-kassel.de/intranet/footernavi/redaktion.html"
11/10/19 13:58:02 INFO parse.ParserChecker: fetching: 
http://www.uni-kassel.de/intranet/footernavi/redaktion.html
11/10/19 13:58:02 WARN plugin.PluginRepository: Plugins: directory not 
found: ${job.local.dir}/../jars/plugins
11/10/19 13:58:02 INFO plugin.PluginRepository: Plugins: looking in: 
/tmp/hadoop-nutch/hadoop-unjar8228180125857982003/plugins
(...)
11/10/19 13:58:02 INFO http.Http: http.proxy.host = null
11/10/19 13:58:02 INFO http.Http: http.proxy.port = 8080
11/10/19 13:58:02 INFO http.Http: http.timeout = 10000
11/10/19 13:58:02 INFO http.Http: http.content.limit = 10485760
11/10/19 13:58:02 INFO http.Http: http.agent = Uni Kassel 
Spider/Nutch-1.3 (Test Crawler des ITS der Uni Kassel)
11/10/19 13:58:02 INFO http.Http: http.accept.language = 
en-us,en-gb,en;q=0.7,*;q=0.3
11/10/19 13:58:05 INFO conf.Configuration: found resource 
tika-mimetypes.xml at 
file:/tmp/hadoop-nutch/hadoop-unjar8228180125857982003/tika-mimetypes.xml
11/10/19 13:58:05 INFO parse.ParserChecker: parsing: 
http://www.uni-kassel.de/intranet/footernavi/redaktion.html
11/10/19 13:58:05 INFO parse.ParserChecker: contentType: 
application/xhtml+xml
11/10/19 13:58:05 INFO conf.Configuration: found resource 
parse-plugins.xml at 
file:/tmp/hadoop-nutch/hadoop-unjar8228180125857982003/parse-plugins.xml
11/10/19 13:58:05 WARN parse.ParserFactory: ParserFactory:Plugin: 
org.apache.nutch.parse.html.HtmlParser mapped to contentType 
application/xhtml+xml via parse-plugins.xml, but its plugin.xml file 
does not claim to support contentType: application/xhtml+xml
---------
Url
---------------
http://www.uni-kassel.de/intranet/footernavi/redaktion.html---------
ParseData
---------
Version: 5
Status: success(1,0)
Title: Intranet: Redaktion
Outlinks: 23
   outlink: toUrl: 
http://www.uni-kassel.de/intranet/footernavi/typo3/ext/uk_solr_search//autocompletion/completer.php 
anchor:
   outlink: toUrl: 
http://www.uni-kassel.de/intranet/footernavi/nbjmup+jousbofuAvoj.lbttfm/ef 
anchor:
   outlink: toUrl: 
http://www.uni-kassel.de/intranet/footernavi/nbjmup+qptutufmmfAvoj.lbttfm/ef 
anchor:
   outlink: toUrl: 
http://www.uni-kassel.de/intranet/footernavi/redaktion.html#nav anchor: 
Skip to navigation (Press Enter).
   outlink: toUrl: 
http://www.uni-kassel.de/intranet/footernavi/redaktion.html#col3 anchor: 
Skip to main content (Press Enter).
   outlink: toUrl: 
http://www.uni-kassel.de/intranet/metanavi/zur-uni-startseite.html 
anchor: zur Uni-Startseite
   outlink: toUrl: 
http://www.uni-kassel.de/intranet/aktuelles/aktuelles-aus.html anchor: 
Intranet
   outlink: toUrl: 
http://www.uni-kassel.de/intranet/footernavi/redaktion.html anchor: 
Redaktion
   outlink: toUrl: http://www.uni-kassel.de/ anchor: Logo der 
Universität Kassel
   outlink: toUrl: 
http://www.uni-kassel.de/intranet/aktuelles/aktuelles-aus.html anchor: 
Aktuelles
   outlink: toUrl: 
http://www.uni-kassel.de/intranet/themen/ueberblick.html anchor: Themen
   outlink: toUrl: 
http://www.uni-kassel.de/intranet/abteilungen/ueberblick.html anchor: 
Abteilungen
   outlink: toUrl: 
http://www.uni-kassel.de/intranet/organisation/ueberblick.html anchor: 
Organisation
   outlink: toUrl: 
http://www.uni-kassel.de/intranet/schnelleinstieg/ueberblick.html 
anchor: Schnelleinstieg
   outlink: toUrl: 
http://www.uni-kassel.de/intranet/aktuelles/aktuelles-aus/umfrage.html 
anchor: Feedback
   outlink: toUrl: 
http://www.uni-kassel.de/intranet/footernavi/redaktion.html anchor: 
Redaktion
   outlink: toUrl: 
http://www.uni-kassel.de/intranet/footernavi/sitemap.html anchor: Sitemap
   outlink: toUrl: 
http://www.uni-kassel.de/intranet/footernavi/impressum.html anchor: 
Impressum
   outlink: toUrl: http://www.uni-kassel.de/intranet/typo3/backend.php 
anchor: Typo3-Login
   outlink: toUrl: 
http://www.uni-kassel.de/intranet/aktuelles/aktuelles-aus/umfrage.html 
anchor: Feedback
   outlink: toUrl: 
http://www.uni-kassel.de/intranet/footernavi/redaktion.html anchor: 
Redaktion
   outlink: toUrl: 
http://www.uni-kassel.de/intranet/footernavi/sitemap.html anchor: Sitemap
   outlink: toUrl: 
http://www.uni-kassel.de/intranet/footernavi/impressum.html anchor: 
Impressum
Content Metadata: Content-Length=3365 Expires=Thu, 19 Nov 1981 08:52:00 
GMT Set-Cookie=fe_typo_user=463afb6f8c8a68f74d4cfaefce7390c1; 
path=/intranet/ Connection=close Server=Apache/2.2.9 (Debian) 
mod_ssl/2.2.9 OpenSSL/0.9.8g X-Powered-By=PHP/5.2.6-1+lenny9 
Cache-Control=no-store, no-cache, must-revalidate, post-check=0, 
pre-check=0 Pragma=no-cache Date=Wed, 19 Oct 2011 11:58:48 GMT 
Vary=Accept-Encoding Content-Encoding=gzip Via=1.0 www.uni-kassel.de 
(Apache/2.2.9) Content-Type=text/html;charset=utf-8
Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8

As you can see, the second and the third outlink goes to this crypto 
string "nbjmup+jousbofuAvoj.lbttfm/ef".
In the source code of 
http://www.uni-kassel.de/intranet/footernavi/redaktion.htm I searched 
for the string "nbjmup+jousbofuAvoj.lbttfm/ef"

The result:
<a 
href="javascript:linkTo_UnCryptMailto('nbjmup+jousbofuAvoj.lbttfm/ef');" 
class="mail">

So it seems to me, that the function argument is taken as a link. It is 
important to avoid this behaviour, since there are many sites in our 
network which uses this javascript function.

How can I fix that? Is it happening in the parser?

Thank you very much



On 17.10.2011 17:13, Marek Bachmann wrote:
> Oh yeah, these tools are great, had forgot them :)
>
> On 17.10.2011 17:05, Markus Jelsma wrote:
>> I believe these links are extracted at some point but discarded in
>> default
>> filters. You can test with parsechecker and see what you got.
>>
>>
>> On Monday 17 October 2011 15:47:22 Marek Bachmann wrote:
>>> Hello List,
>>>
>>> perhaps someone can give me a quick answer to this question:
>>>
>>> Some pages have "<a>" tag with javaScript content e.g.:
>>>
>>> <a
>>> href="javascript:linkTo_UnCryptMailto('nbjmup+ijxj.tuvecfsAvoj.lbttfm/ef');
>>>
>>> " class="mail">
>>>
>>> How will this href be handled?
>>>
>>> Thank you
>>
>

Re: How does nutch handles javaScript in href

Posted by Marek Bachmann <m....@uni-kassel.de>.

Oh yeah, these tools are great, had forgot them :)

On 17.10.2011 17:05, Markus Jelsma wrote:
> I believe these links are extracted at some point but discarded in default
> filters. You can test with parsechecker and see what you got.
>
>
> On Monday 17 October 2011 15:47:22 Marek Bachmann wrote:
>> Hello List,
>>
>> perhaps someone can give me a quick answer to this question:
>>
>> Some pages have "<a>" tag with javaScript content e.g.:
>>
>> <a
>> href="javascript:linkTo_UnCryptMailto('nbjmup+ijxj.tuvecfsAvoj.lbttfm/ef');
>> " class="mail">
>>
>> How will this href be handled?
>>
>> Thank you
>

Re: How does nutch handles javaScript in href

Posted by Markus Jelsma <ma...@openindex.io>.

I believe these links are extracted at some point but discarded in default 
filters. You can test with parsechecker and see what you got.


On Monday 17 October 2011 15:47:22 Marek Bachmann wrote:
> Hello List,
> 
> perhaps someone can give me a quick answer to this question:
> 
> Some pages have "<a>" tag with javaScript content e.g.:
> 
> <a
> href="javascript:linkTo_UnCryptMailto('nbjmup+ijxj.tuvecfsAvoj.lbttfm/ef');
> " class="mail">
> 
> How will this href be handled?
> 
> Thank you

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350