You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Julien Nioche (JIRA)" <ji...@apache.org> on 2011/04/18 18:52:05 UTC

[jira] [Commented] (NUTCH-984) Parse-tika throws some URL's away

    [ https://issues.apache.org/jira/browse/NUTCH-984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13021099#comment-13021099 ] 

Julien Nioche commented on NUTCH-984:
-------------------------------------

Could you test the URLs above directly with Tika 0.9? I suppose this has to do with the default mappers used by Tika which we can override from Nutch.

BTW this illustrates why parse-html is still the default option for html and parse-tika is used for the other mime-types. I'd suggest that we mark this as fixed in 2.0 as 1.3 is about to be RCed. More generally the tests that are used for checking the html parsing need to be ported to parse-tika as well




> Parse-tika throws some URL's away
> ---------------------------------
>
>                 Key: NUTCH-984
>                 URL: https://issues.apache.org/jira/browse/NUTCH-984
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.3, 2.0
>            Reporter: Markus Jelsma
>            Priority: Critical
>             Fix For: 1.3, 2.0
>
>
> For some reason using parse-tika a crawl just wouldn't dive into some website news archive. The paging through the news archive is done with simple anchors:
> <div class="page active">1</div> <a href="/nieuws/overzicht/1/"><div class="page">2</div> </a> <a href="/nieuws/overzicht/2/"><div class="page">3</div> </a>
> I added some logging to DOMContentUtils:
> 2011-04-18 18:26:09,788 INFO  tika.DOMContentUtils - Throw away link:  http://www.site.nl/nieuws/overzicht/1/
> 2011-04-18 18:26:09,788 INFO  tika.DOMContentUtils - Throw away link:  http://www.site.nl/nieuws/overzicht/2/
> 2011-04-18 18:26:09,788 INFO  tika.DOMContentUtils - Throw away link:  http://www.site.nl/nieuws/overzicht/3/
> ...
> Now, this is rather funky. The code for private boolean shouldThrowAwayLink(Node node, NodeList children, int childLen, LinkParams params) is the same for parse-html and parse-tika. I also tested the two parsers between versions 1.2 and 1.3 for the following URL.
> http://news.bbc.co.uk/2/hi/europe/country_profiles/1154019.stm
>  1.2 - parse-tika: 196
>  1.2 - parse-html: 296
>  1.3 - parse-tika: 279
>  1.3 - parse-html: 296
> Something clearly improved in 1.3 but not generating the remaining URL's are a blocker for parse-tika in my case. Relevant configurations are the same parser.html.outlinks.ignore_tags is not being used. Testing has been done with ParserChecker only.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira