You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Markus Jelsma (JIRA)" <ji...@apache.org> on 2011/04/18 18:37:05 UTC

[jira] [Created] (NUTCH-984) Parse-tika throws some URL's away

Parse-tika throws some URL's away
---------------------------------

                 Key: NUTCH-984
                 URL: https://issues.apache.org/jira/browse/NUTCH-984
             Project: Nutch
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.3, 2.0
            Reporter: Markus Jelsma
            Priority: Critical
             Fix For: 1.3, 2.0


For some reason using parse-tika a crawl just wouldn't dive into some website news archive. The paging through the news archive is done with simple anchors:

<div class="page active">1</div> <a href="/nieuws/overzicht/1/"><div class="page">2</div> </a> <a href="/nieuws/overzicht/2/"><div class="page">3</div> </a>

I added some logging to DOMContentUtils:
2011-04-18 18:26:09,788 INFO  tika.DOMContentUtils - Throw away link:  http://www.site.nl/nieuws/overzicht/1/
2011-04-18 18:26:09,788 INFO  tika.DOMContentUtils - Throw away link:  http://www.site.nl/nieuws/overzicht/2/
2011-04-18 18:26:09,788 INFO  tika.DOMContentUtils - Throw away link:  http://www.site.nl/nieuws/overzicht/3/
...

Now, this is rather funky. The code for private boolean shouldThrowAwayLink(Node node, NodeList children, int childLen, LinkParams params) is the same for parse-html and parse-tika. I also tested the two parsers between versions 1.2 and 1.3 for the following URL.

http://news.bbc.co.uk/2/hi/europe/country_profiles/1154019.stm

 1.2 - parse-tika: 196
 1.2 - parse-html: 296
 1.3 - parse-tika: 279
 1.3 - parse-html: 296

Something clearly improved in 1.3 but not generating the remaining URL's are a blocker for parse-tika in my case.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-984) Parse-tika throws some URL's away

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-984:
--------------------------------

    Description: 
For some reason using parse-tika a crawl just wouldn't dive into some website news archive. The paging through the news archive is done with simple anchors:

<div class="page active">1</div> <a href="/nieuws/overzicht/1/"><div class="page">2</div> </a> <a href="/nieuws/overzicht/2/"><div class="page">3</div> </a>

I added some logging to DOMContentUtils:
2011-04-18 18:26:09,788 INFO  tika.DOMContentUtils - Throw away link:  http://www.site.nl/nieuws/overzicht/1/
2011-04-18 18:26:09,788 INFO  tika.DOMContentUtils - Throw away link:  http://www.site.nl/nieuws/overzicht/2/
2011-04-18 18:26:09,788 INFO  tika.DOMContentUtils - Throw away link:  http://www.site.nl/nieuws/overzicht/3/
...

Now, this is rather funky. The code for private boolean shouldThrowAwayLink(Node node, NodeList children, int childLen, LinkParams params) is the same for parse-html and parse-tika. I also tested the two parsers between versions 1.2 and 1.3 for the following URL.

http://news.bbc.co.uk/2/hi/europe/country_profiles/1154019.stm

 1.2 - parse-tika: 196
 1.2 - parse-html: 296
 1.3 - parse-tika: 279
 1.3 - parse-html: 296

Something clearly improved in 1.3 but not generating the remaining URL's are a blocker for parse-tika in my case. Relevant configurations are the same parser.html.outlinks.ignore_tags is not being used. Testing has been done with ParserChecker only.

  was:
For some reason using parse-tika a crawl just wouldn't dive into some website news archive. The paging through the news archive is done with simple anchors:

<div class="page active">1</div> <a href="/nieuws/overzicht/1/"><div class="page">2</div> </a> <a href="/nieuws/overzicht/2/"><div class="page">3</div> </a>

I added some logging to DOMContentUtils:
2011-04-18 18:26:09,788 INFO  tika.DOMContentUtils - Throw away link:  http://www.site.nl/nieuws/overzicht/1/
2011-04-18 18:26:09,788 INFO  tika.DOMContentUtils - Throw away link:  http://www.site.nl/nieuws/overzicht/2/
2011-04-18 18:26:09,788 INFO  tika.DOMContentUtils - Throw away link:  http://www.site.nl/nieuws/overzicht/3/
...

Now, this is rather funky. The code for private boolean shouldThrowAwayLink(Node node, NodeList children, int childLen, LinkParams params) is the same for parse-html and parse-tika. I also tested the two parsers between versions 1.2 and 1.3 for the following URL.

http://news.bbc.co.uk/2/hi/europe/country_profiles/1154019.stm

 1.2 - parse-tika: 196
 1.2 - parse-html: 296
 1.3 - parse-tika: 279
 1.3 - parse-html: 296

Something clearly improved in 1.3 but not generating the remaining URL's are a blocker for parse-tika in my case.


> Parse-tika throws some URL's away
> ---------------------------------
>
>                 Key: NUTCH-984
>                 URL: https://issues.apache.org/jira/browse/NUTCH-984
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.3, 2.0
>            Reporter: Markus Jelsma
>            Priority: Critical
>             Fix For: 1.3, 2.0
>
>
> For some reason using parse-tika a crawl just wouldn't dive into some website news archive. The paging through the news archive is done with simple anchors:
> <div class="page active">1</div> <a href="/nieuws/overzicht/1/"><div class="page">2</div> </a> <a href="/nieuws/overzicht/2/"><div class="page">3</div> </a>
> I added some logging to DOMContentUtils:
> 2011-04-18 18:26:09,788 INFO  tika.DOMContentUtils - Throw away link:  http://www.site.nl/nieuws/overzicht/1/
> 2011-04-18 18:26:09,788 INFO  tika.DOMContentUtils - Throw away link:  http://www.site.nl/nieuws/overzicht/2/
> 2011-04-18 18:26:09,788 INFO  tika.DOMContentUtils - Throw away link:  http://www.site.nl/nieuws/overzicht/3/
> ...
> Now, this is rather funky. The code for private boolean shouldThrowAwayLink(Node node, NodeList children, int childLen, LinkParams params) is the same for parse-html and parse-tika. I also tested the two parsers between versions 1.2 and 1.3 for the following URL.
> http://news.bbc.co.uk/2/hi/europe/country_profiles/1154019.stm
>  1.2 - parse-tika: 196
>  1.2 - parse-html: 296
>  1.3 - parse-tika: 279
>  1.3 - parse-html: 296
> Something clearly improved in 1.3 but not generating the remaining URL's are a blocker for parse-tika in my case. Relevant configurations are the same parser.html.outlinks.ignore_tags is not being used. Testing has been done with ParserChecker only.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (NUTCH-984) Parse-tika throws some URL's away

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13021113#comment-13021113 ] 

Markus Jelsma edited comment on NUTCH-984 at 4/26/11 4:02 PM:
--------------------------------------------------------------

Yes i can test these URL's with tika-parsers 0.9 but what do you want to see? They seem to be parsed correctly when using the -t option but not when using -h or -x. The anchors become

<a shape="rect" href="http://www.site.nl/nieuws/het-laatste-nieuws/overzicht/2/"/>3

So in this case the anchor indeed doesn't contain data and is thus thrown away. Might be a Tika issue instead!

      was (Author: markus17):
    Yes i can test these URL's with tika-parsers 0.9 but what do you want to see? They seem to be parsed correctly when using the -t option but not when using -h or -x. The anchors become

<a shape="rect" href="http://www.arriva.nl/nieuws/het-laatste-nieuws/overzicht/2/"/>3

So in this case the anchor indeed doesn't contain data and is thus thrown away. Might be a Tika issue instead!
  
> Parse-tika throws some URL's away
> ---------------------------------
>
>                 Key: NUTCH-984
>                 URL: https://issues.apache.org/jira/browse/NUTCH-984
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.3, 2.0
>            Reporter: Markus Jelsma
>            Priority: Critical
>             Fix For: 1.3, 2.0
>
>
> For some reason using parse-tika a crawl just wouldn't dive into some website news archive. The paging through the news archive is done with simple anchors:
> <div class="page active">1</div> <a href="/nieuws/overzicht/1/"><div class="page">2</div> </a> <a href="/nieuws/overzicht/2/"><div class="page">3</div> </a>
> I added some logging to DOMContentUtils:
> 2011-04-18 18:26:09,788 INFO  tika.DOMContentUtils - Throw away link:  http://www.site.nl/nieuws/overzicht/1/
> 2011-04-18 18:26:09,788 INFO  tika.DOMContentUtils - Throw away link:  http://www.site.nl/nieuws/overzicht/2/
> 2011-04-18 18:26:09,788 INFO  tika.DOMContentUtils - Throw away link:  http://www.site.nl/nieuws/overzicht/3/
> ...
> Now, this is rather funky. The code for private boolean shouldThrowAwayLink(Node node, NodeList children, int childLen, LinkParams params) is the same for parse-html and parse-tika. I also tested the two parsers between versions 1.2 and 1.3 for the following URL.
> http://news.bbc.co.uk/2/hi/europe/country_profiles/1154019.stm
>  1.2 - parse-tika: 196
>  1.2 - parse-html: 296
>  1.3 - parse-tika: 279
>  1.3 - parse-html: 296
> Something clearly improved in 1.3 but not generating the remaining URL's are a blocker for parse-tika in my case. Relevant configurations are the same parser.html.outlinks.ignore_tags is not being used. Testing has been done with ParserChecker only.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-984) Parse-tika throws some URL's away

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13021113#comment-13021113 ] 

Markus Jelsma commented on NUTCH-984:
-------------------------------------

Yes i can test these URL's with tika-parsers 0.9 but what do you want to see? They seem to be parsed correctly when using the -t option but not when using -h or -x. The anchors become

<a shape="rect" href="http://www.arriva.nl/nieuws/het-laatste-nieuws/overzicht/2/"/>3

So in this case the anchor indeed doesn't contain data and is thus thrown away. Might be a Tika issue instead!

> Parse-tika throws some URL's away
> ---------------------------------
>
>                 Key: NUTCH-984
>                 URL: https://issues.apache.org/jira/browse/NUTCH-984
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.3, 2.0
>            Reporter: Markus Jelsma
>            Priority: Critical
>             Fix For: 1.3, 2.0
>
>
> For some reason using parse-tika a crawl just wouldn't dive into some website news archive. The paging through the news archive is done with simple anchors:
> <div class="page active">1</div> <a href="/nieuws/overzicht/1/"><div class="page">2</div> </a> <a href="/nieuws/overzicht/2/"><div class="page">3</div> </a>
> I added some logging to DOMContentUtils:
> 2011-04-18 18:26:09,788 INFO  tika.DOMContentUtils - Throw away link:  http://www.site.nl/nieuws/overzicht/1/
> 2011-04-18 18:26:09,788 INFO  tika.DOMContentUtils - Throw away link:  http://www.site.nl/nieuws/overzicht/2/
> 2011-04-18 18:26:09,788 INFO  tika.DOMContentUtils - Throw away link:  http://www.site.nl/nieuws/overzicht/3/
> ...
> Now, this is rather funky. The code for private boolean shouldThrowAwayLink(Node node, NodeList children, int childLen, LinkParams params) is the same for parse-html and parse-tika. I also tested the two parsers between versions 1.2 and 1.3 for the following URL.
> http://news.bbc.co.uk/2/hi/europe/country_profiles/1154019.stm
>  1.2 - parse-tika: 196
>  1.2 - parse-html: 296
>  1.3 - parse-tika: 279
>  1.3 - parse-html: 296
> Something clearly improved in 1.3 but not generating the remaining URL's are a blocker for parse-tika in my case. Relevant configurations are the same parser.html.outlinks.ignore_tags is not being used. Testing has been done with ParserChecker only.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-984) Parse-tika throws some URL's away

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13021099#comment-13021099 ] 

Julien Nioche commented on NUTCH-984:
-------------------------------------

Could you test the URLs above directly with Tika 0.9? I suppose this has to do with the default mappers used by Tika which we can override from Nutch.

BTW this illustrates why parse-html is still the default option for html and parse-tika is used for the other mime-types. I'd suggest that we mark this as fixed in 2.0 as 1.3 is about to be RCed. More generally the tests that are used for checking the html parsing need to be ported to parse-tika as well




> Parse-tika throws some URL's away
> ---------------------------------
>
>                 Key: NUTCH-984
>                 URL: https://issues.apache.org/jira/browse/NUTCH-984
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.3, 2.0
>            Reporter: Markus Jelsma
>            Priority: Critical
>             Fix For: 1.3, 2.0
>
>
> For some reason using parse-tika a crawl just wouldn't dive into some website news archive. The paging through the news archive is done with simple anchors:
> <div class="page active">1</div> <a href="/nieuws/overzicht/1/"><div class="page">2</div> </a> <a href="/nieuws/overzicht/2/"><div class="page">3</div> </a>
> I added some logging to DOMContentUtils:
> 2011-04-18 18:26:09,788 INFO  tika.DOMContentUtils - Throw away link:  http://www.site.nl/nieuws/overzicht/1/
> 2011-04-18 18:26:09,788 INFO  tika.DOMContentUtils - Throw away link:  http://www.site.nl/nieuws/overzicht/2/
> 2011-04-18 18:26:09,788 INFO  tika.DOMContentUtils - Throw away link:  http://www.site.nl/nieuws/overzicht/3/
> ...
> Now, this is rather funky. The code for private boolean shouldThrowAwayLink(Node node, NodeList children, int childLen, LinkParams params) is the same for parse-html and parse-tika. I also tested the two parsers between versions 1.2 and 1.3 for the following URL.
> http://news.bbc.co.uk/2/hi/europe/country_profiles/1154019.stm
>  1.2 - parse-tika: 196
>  1.2 - parse-html: 296
>  1.3 - parse-tika: 279
>  1.3 - parse-html: 296
> Something clearly improved in 1.3 but not generating the remaining URL's are a blocker for parse-tika in my case. Relevant configurations are the same parser.html.outlinks.ignore_tags is not being used. Testing has been done with ParserChecker only.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (NUTCH-984) Parse-tika throws some URL's away

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann resolved NUTCH-984.
-------------------------------------

    Resolution: Won't Fix

Looks like this is a Tika issue. If not, please let someone know or file a new issue.

Thanks!

> Parse-tika throws some URL's away
> ---------------------------------
>
>                 Key: NUTCH-984
>                 URL: https://issues.apache.org/jira/browse/NUTCH-984
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.3, 2.0
>            Reporter: Markus Jelsma
>            Priority: Critical
>             Fix For: 1.3, 2.0
>
>
> For some reason using parse-tika a crawl just wouldn't dive into some website news archive. The paging through the news archive is done with simple anchors:
> <div class="page active">1</div> <a href="/nieuws/overzicht/1/"><div class="page">2</div> </a> <a href="/nieuws/overzicht/2/"><div class="page">3</div> </a>
> I added some logging to DOMContentUtils:
> 2011-04-18 18:26:09,788 INFO  tika.DOMContentUtils - Throw away link:  http://www.site.nl/nieuws/overzicht/1/
> 2011-04-18 18:26:09,788 INFO  tika.DOMContentUtils - Throw away link:  http://www.site.nl/nieuws/overzicht/2/
> 2011-04-18 18:26:09,788 INFO  tika.DOMContentUtils - Throw away link:  http://www.site.nl/nieuws/overzicht/3/
> ...
> Now, this is rather funky. The code for private boolean shouldThrowAwayLink(Node node, NodeList children, int childLen, LinkParams params) is the same for parse-html and parse-tika. I also tested the two parsers between versions 1.2 and 1.3 for the following URL.
> http://news.bbc.co.uk/2/hi/europe/country_profiles/1154019.stm
>  1.2 - parse-tika: 196
>  1.2 - parse-html: 296
>  1.3 - parse-tika: 279
>  1.3 - parse-html: 296
> Something clearly improved in 1.3 but not generating the remaining URL's are a blocker for parse-tika in my case. Relevant configurations are the same parser.html.outlinks.ignore_tags is not being used. Testing has been done with ParserChecker only.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Closed] (NUTCH-984) Parse-tika throws some URL's away

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma closed NUTCH-984.
-------------------------------


Bulk close of resolved issues for 1.3.

> Parse-tika throws some URL's away
> ---------------------------------
>
>                 Key: NUTCH-984
>                 URL: https://issues.apache.org/jira/browse/NUTCH-984
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.3, 2.0
>            Reporter: Markus Jelsma
>            Priority: Critical
>             Fix For: 1.3, 2.0
>
>
> For some reason using parse-tika a crawl just wouldn't dive into some website news archive. The paging through the news archive is done with simple anchors:
> <div class="page active">1</div> <a href="/nieuws/overzicht/1/"><div class="page">2</div> </a> <a href="/nieuws/overzicht/2/"><div class="page">3</div> </a>
> I added some logging to DOMContentUtils:
> 2011-04-18 18:26:09,788 INFO  tika.DOMContentUtils - Throw away link:  http://www.site.nl/nieuws/overzicht/1/
> 2011-04-18 18:26:09,788 INFO  tika.DOMContentUtils - Throw away link:  http://www.site.nl/nieuws/overzicht/2/
> 2011-04-18 18:26:09,788 INFO  tika.DOMContentUtils - Throw away link:  http://www.site.nl/nieuws/overzicht/3/
> ...
> Now, this is rather funky. The code for private boolean shouldThrowAwayLink(Node node, NodeList children, int childLen, LinkParams params) is the same for parse-html and parse-tika. I also tested the two parsers between versions 1.2 and 1.3 for the following URL.
> http://news.bbc.co.uk/2/hi/europe/country_profiles/1154019.stm
>  1.2 - parse-tika: 196
>  1.2 - parse-html: 296
>  1.3 - parse-tika: 279
>  1.3 - parse-html: 296
> Something clearly improved in 1.3 but not generating the remaining URL's are a blocker for parse-tika in my case. Relevant configurations are the same parser.html.outlinks.ignore_tags is not being used. Testing has been done with ParserChecker only.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira