You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Markus Jelsma (JIRA)" <ji...@apache.org> on 2011/04/18 18:37:05 UTC
[jira] [Updated] (NUTCH-984) Parse-tika throws some URL's away

     [ https://issues.apache.org/jira/browse/NUTCH-984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-984:
--------------------------------

    Description: 
For some reason using parse-tika a crawl just wouldn't dive into some website news archive. The paging through the news archive is done with simple anchors:

<div class="page active">1</div> <a href="/nieuws/overzicht/1/"><div class="page">2</div> </a> <a href="/nieuws/overzicht/2/"><div class="page">3</div> </a>

I added some logging to DOMContentUtils:
2011-04-18 18:26:09,788 INFO  tika.DOMContentUtils - Throw away link:  http://www.site.nl/nieuws/overzicht/1/
2011-04-18 18:26:09,788 INFO  tika.DOMContentUtils - Throw away link:  http://www.site.nl/nieuws/overzicht/2/
2011-04-18 18:26:09,788 INFO  tika.DOMContentUtils - Throw away link:  http://www.site.nl/nieuws/overzicht/3/
...

Now, this is rather funky. The code for private boolean shouldThrowAwayLink(Node node, NodeList children, int childLen, LinkParams params) is the same for parse-html and parse-tika. I also tested the two parsers between versions 1.2 and 1.3 for the following URL.

http://news.bbc.co.uk/2/hi/europe/country_profiles/1154019.stm

 1.2 - parse-tika: 196
 1.2 - parse-html: 296
 1.3 - parse-tika: 279
 1.3 - parse-html: 296

Something clearly improved in 1.3 but not generating the remaining URL's are a blocker for parse-tika in my case. Relevant configurations are the same parser.html.outlinks.ignore_tags is not being used. Testing has been done with ParserChecker only.

  was:
For some reason using parse-tika a crawl just wouldn't dive into some website news archive. The paging through the news archive is done with simple anchors:

<div class="page active">1</div> <a href="/nieuws/overzicht/1/"><div class="page">2</div> </a> <a href="/nieuws/overzicht/2/"><div class="page">3</div> </a>

I added some logging to DOMContentUtils:
2011-04-18 18:26:09,788 INFO  tika.DOMContentUtils - Throw away link:  http://www.site.nl/nieuws/overzicht/1/
2011-04-18 18:26:09,788 INFO  tika.DOMContentUtils - Throw away link:  http://www.site.nl/nieuws/overzicht/2/
2011-04-18 18:26:09,788 INFO  tika.DOMContentUtils - Throw away link:  http://www.site.nl/nieuws/overzicht/3/
...

Now, this is rather funky. The code for private boolean shouldThrowAwayLink(Node node, NodeList children, int childLen, LinkParams params) is the same for parse-html and parse-tika. I also tested the two parsers between versions 1.2 and 1.3 for the following URL.

http://news.bbc.co.uk/2/hi/europe/country_profiles/1154019.stm

 1.2 - parse-tika: 196
 1.2 - parse-html: 296
 1.3 - parse-tika: 279
 1.3 - parse-html: 296

Something clearly improved in 1.3 but not generating the remaining URL's are a blocker for parse-tika in my case.


> Parse-tika throws some URL's away
> ---------------------------------
>
>                 Key: NUTCH-984
>                 URL: https://issues.apache.org/jira/browse/NUTCH-984
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.3, 2.0
>            Reporter: Markus Jelsma
>            Priority: Critical
>             Fix For: 1.3, 2.0
>
>
> For some reason using parse-tika a crawl just wouldn't dive into some website news archive. The paging through the news archive is done with simple anchors:
> <div class="page active">1</div> <a href="/nieuws/overzicht/1/"><div class="page">2</div> </a> <a href="/nieuws/overzicht/2/"><div class="page">3</div> </a>
> I added some logging to DOMContentUtils:
> 2011-04-18 18:26:09,788 INFO  tika.DOMContentUtils - Throw away link:  http://www.site.nl/nieuws/overzicht/1/
> 2011-04-18 18:26:09,788 INFO  tika.DOMContentUtils - Throw away link:  http://www.site.nl/nieuws/overzicht/2/
> 2011-04-18 18:26:09,788 INFO  tika.DOMContentUtils - Throw away link:  http://www.site.nl/nieuws/overzicht/3/
> ...
> Now, this is rather funky. The code for private boolean shouldThrowAwayLink(Node node, NodeList children, int childLen, LinkParams params) is the same for parse-html and parse-tika. I also tested the two parsers between versions 1.2 and 1.3 for the following URL.
> http://news.bbc.co.uk/2/hi/europe/country_profiles/1154019.stm
>  1.2 - parse-tika: 196
>  1.2 - parse-html: 296
>  1.3 - parse-tika: 279
>  1.3 - parse-html: 296
> Something clearly improved in 1.3 but not generating the remaining URL's are a blocker for parse-tika in my case. Relevant configurations are the same parser.html.outlinks.ignore_tags is not being used. Testing has been done with ParserChecker only.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira