You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Markus Jelsma (JIRA)" <ji...@apache.org> on 2011/04/18 18:37:05 UTC
[jira] [Created] (NUTCH-984) Parse-tika throws some URL's away
Parse-tika throws some URL's away
---------------------------------
Key: NUTCH-984
URL: https://issues.apache.org/jira/browse/NUTCH-984
Project: Nutch
Issue Type: Bug
Components: parser
Affects Versions: 1.3, 2.0
Reporter: Markus Jelsma
Priority: Critical
Fix For: 1.3, 2.0
For some reason using parse-tika a crawl just wouldn't dive into some website news archive. The paging through the news archive is done with simple anchors:
<div class="page active">1</div> <a href="/nieuws/overzicht/1/"><div class="page">2</div> </a> <a href="/nieuws/overzicht/2/"><div class="page">3</div> </a>
I added some logging to DOMContentUtils:
2011-04-18 18:26:09,788 INFO tika.DOMContentUtils - Throw away link: http://www.site.nl/nieuws/overzicht/1/
2011-04-18 18:26:09,788 INFO tika.DOMContentUtils - Throw away link: http://www.site.nl/nieuws/overzicht/2/
2011-04-18 18:26:09,788 INFO tika.DOMContentUtils - Throw away link: http://www.site.nl/nieuws/overzicht/3/
...
Now, this is rather funky. The code for private boolean shouldThrowAwayLink(Node node, NodeList children, int childLen, LinkParams params) is the same for parse-html and parse-tika. I also tested the two parsers between versions 1.2 and 1.3 for the following URL.
http://news.bbc.co.uk/2/hi/europe/country_profiles/1154019.stm
1.2 - parse-tika: 196
1.2 - parse-html: 296
1.3 - parse-tika: 279
1.3 - parse-html: 296
Something clearly improved in 1.3 but not generating the remaining URL's are a blocker for parse-tika in my case.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-984) Parse-tika throws some URL's away
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma updated NUTCH-984:
--------------------------------
Description:
For some reason using parse-tika a crawl just wouldn't dive into some website news archive. The paging through the news archive is done with simple anchors:
<div class="page active">1</div> <a href="/nieuws/overzicht/1/"><div class="page">2</div> </a> <a href="/nieuws/overzicht/2/"><div class="page">3</div> </a>
I added some logging to DOMContentUtils:
2011-04-18 18:26:09,788 INFO tika.DOMContentUtils - Throw away link: http://www.site.nl/nieuws/overzicht/1/
2011-04-18 18:26:09,788 INFO tika.DOMContentUtils - Throw away link: http://www.site.nl/nieuws/overzicht/2/
2011-04-18 18:26:09,788 INFO tika.DOMContentUtils - Throw away link: http://www.site.nl/nieuws/overzicht/3/
...
Now, this is rather funky. The code for private boolean shouldThrowAwayLink(Node node, NodeList children, int childLen, LinkParams params) is the same for parse-html and parse-tika. I also tested the two parsers between versions 1.2 and 1.3 for the following URL.
http://news.bbc.co.uk/2/hi/europe/country_profiles/1154019.stm
1.2 - parse-tika: 196
1.2 - parse-html: 296
1.3 - parse-tika: 279
1.3 - parse-html: 296
Something clearly improved in 1.3 but not generating the remaining URL's are a blocker for parse-tika in my case. Relevant configurations are the same parser.html.outlinks.ignore_tags is not being used. Testing has been done with ParserChecker only.
was:
For some reason using parse-tika a crawl just wouldn't dive into some website news archive. The paging through the news archive is done with simple anchors:
<div class="page active">1</div> <a href="/nieuws/overzicht/1/"><div class="page">2</div> </a> <a href="/nieuws/overzicht/2/"><div class="page">3</div> </a>
I added some logging to DOMContentUtils:
2011-04-18 18:26:09,788 INFO tika.DOMContentUtils - Throw away link: http://www.site.nl/nieuws/overzicht/1/
2011-04-18 18:26:09,788 INFO tika.DOMContentUtils - Throw away link: http://www.site.nl/nieuws/overzicht/2/
2011-04-18 18:26:09,788 INFO tika.DOMContentUtils - Throw away link: http://www.site.nl/nieuws/overzicht/3/
...
Now, this is rather funky. The code for private boolean shouldThrowAwayLink(Node node, NodeList children, int childLen, LinkParams params) is the same for parse-html and parse-tika. I also tested the two parsers between versions 1.2 and 1.3 for the following URL.
http://news.bbc.co.uk/2/hi/europe/country_profiles/1154019.stm
1.2 - parse-tika: 196
1.2 - parse-html: 296
1.3 - parse-tika: 279
1.3 - parse-html: 296
Something clearly improved in 1.3 but not generating the remaining URL's are a blocker for parse-tika in my case.
> Parse-tika throws some URL's away
> ---------------------------------
>
> Key: NUTCH-984
> URL: https://issues.apache.org/jira/browse/NUTCH-984
> Project: Nutch
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.3, 2.0
> Reporter: Markus Jelsma
> Priority: Critical
> Fix For: 1.3, 2.0
>
>
> For some reason using parse-tika a crawl just wouldn't dive into some website news archive. The paging through the news archive is done with simple anchors:
> <div class="page active">1</div> <a href="/nieuws/overzicht/1/"><div class="page">2</div> </a> <a href="/nieuws/overzicht/2/"><div class="page">3</div> </a>
> I added some logging to DOMContentUtils:
> 2011-04-18 18:26:09,788 INFO tika.DOMContentUtils - Throw away link: http://www.site.nl/nieuws/overzicht/1/
> 2011-04-18 18:26:09,788 INFO tika.DOMContentUtils - Throw away link: http://www.site.nl/nieuws/overzicht/2/
> 2011-04-18 18:26:09,788 INFO tika.DOMContentUtils - Throw away link: http://www.site.nl/nieuws/overzicht/3/
> ...
> Now, this is rather funky. The code for private boolean shouldThrowAwayLink(Node node, NodeList children, int childLen, LinkParams params) is the same for parse-html and parse-tika. I also tested the two parsers between versions 1.2 and 1.3 for the following URL.
> http://news.bbc.co.uk/2/hi/europe/country_profiles/1154019.stm
> 1.2 - parse-tika: 196
> 1.2 - parse-html: 296
> 1.3 - parse-tika: 279
> 1.3 - parse-html: 296
> Something clearly improved in 1.3 but not generating the remaining URL's are a blocker for parse-tika in my case. Relevant configurations are the same parser.html.outlinks.ignore_tags is not being used. Testing has been done with ParserChecker only.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Issue Comment Edited] (NUTCH-984) Parse-tika throws some
URL's away
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13021113#comment-13021113 ]
Markus Jelsma edited comment on NUTCH-984 at 4/26/11 4:02 PM:
--------------------------------------------------------------
Yes i can test these URL's with tika-parsers 0.9 but what do you want to see? They seem to be parsed correctly when using the -t option but not when using -h or -x. The anchors become
<a shape="rect" href="http://www.site.nl/nieuws/het-laatste-nieuws/overzicht/2/"/>3
So in this case the anchor indeed doesn't contain data and is thus thrown away. Might be a Tika issue instead!
was (Author: markus17):
Yes i can test these URL's with tika-parsers 0.9 but what do you want to see? They seem to be parsed correctly when using the -t option but not when using -h or -x. The anchors become
<a shape="rect" href="http://www.arriva.nl/nieuws/het-laatste-nieuws/overzicht/2/"/>3
So in this case the anchor indeed doesn't contain data and is thus thrown away. Might be a Tika issue instead!
> Parse-tika throws some URL's away
> ---------------------------------
>
> Key: NUTCH-984
> URL: https://issues.apache.org/jira/browse/NUTCH-984
> Project: Nutch
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.3, 2.0
> Reporter: Markus Jelsma
> Priority: Critical
> Fix For: 1.3, 2.0
>
>
> For some reason using parse-tika a crawl just wouldn't dive into some website news archive. The paging through the news archive is done with simple anchors:
> <div class="page active">1</div> <a href="/nieuws/overzicht/1/"><div class="page">2</div> </a> <a href="/nieuws/overzicht/2/"><div class="page">3</div> </a>
> I added some logging to DOMContentUtils:
> 2011-04-18 18:26:09,788 INFO tika.DOMContentUtils - Throw away link: http://www.site.nl/nieuws/overzicht/1/
> 2011-04-18 18:26:09,788 INFO tika.DOMContentUtils - Throw away link: http://www.site.nl/nieuws/overzicht/2/
> 2011-04-18 18:26:09,788 INFO tika.DOMContentUtils - Throw away link: http://www.site.nl/nieuws/overzicht/3/
> ...
> Now, this is rather funky. The code for private boolean shouldThrowAwayLink(Node node, NodeList children, int childLen, LinkParams params) is the same for parse-html and parse-tika. I also tested the two parsers between versions 1.2 and 1.3 for the following URL.
> http://news.bbc.co.uk/2/hi/europe/country_profiles/1154019.stm
> 1.2 - parse-tika: 196
> 1.2 - parse-html: 296
> 1.3 - parse-tika: 279
> 1.3 - parse-html: 296
> Something clearly improved in 1.3 but not generating the remaining URL's are a blocker for parse-tika in my case. Relevant configurations are the same parser.html.outlinks.ignore_tags is not being used. Testing has been done with ParserChecker only.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-984) Parse-tika throws some URL's away
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13021113#comment-13021113 ]
Markus Jelsma commented on NUTCH-984:
-------------------------------------
Yes i can test these URL's with tika-parsers 0.9 but what do you want to see? They seem to be parsed correctly when using the -t option but not when using -h or -x. The anchors become
<a shape="rect" href="http://www.arriva.nl/nieuws/het-laatste-nieuws/overzicht/2/"/>3
So in this case the anchor indeed doesn't contain data and is thus thrown away. Might be a Tika issue instead!
> Parse-tika throws some URL's away
> ---------------------------------
>
> Key: NUTCH-984
> URL: https://issues.apache.org/jira/browse/NUTCH-984
> Project: Nutch
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.3, 2.0
> Reporter: Markus Jelsma
> Priority: Critical
> Fix For: 1.3, 2.0
>
>
> For some reason using parse-tika a crawl just wouldn't dive into some website news archive. The paging through the news archive is done with simple anchors:
> <div class="page active">1</div> <a href="/nieuws/overzicht/1/"><div class="page">2</div> </a> <a href="/nieuws/overzicht/2/"><div class="page">3</div> </a>
> I added some logging to DOMContentUtils:
> 2011-04-18 18:26:09,788 INFO tika.DOMContentUtils - Throw away link: http://www.site.nl/nieuws/overzicht/1/
> 2011-04-18 18:26:09,788 INFO tika.DOMContentUtils - Throw away link: http://www.site.nl/nieuws/overzicht/2/
> 2011-04-18 18:26:09,788 INFO tika.DOMContentUtils - Throw away link: http://www.site.nl/nieuws/overzicht/3/
> ...
> Now, this is rather funky. The code for private boolean shouldThrowAwayLink(Node node, NodeList children, int childLen, LinkParams params) is the same for parse-html and parse-tika. I also tested the two parsers between versions 1.2 and 1.3 for the following URL.
> http://news.bbc.co.uk/2/hi/europe/country_profiles/1154019.stm
> 1.2 - parse-tika: 196
> 1.2 - parse-html: 296
> 1.3 - parse-tika: 279
> 1.3 - parse-html: 296
> Something clearly improved in 1.3 but not generating the remaining URL's are a blocker for parse-tika in my case. Relevant configurations are the same parser.html.outlinks.ignore_tags is not being used. Testing has been done with ParserChecker only.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-984) Parse-tika throws some URL's away
Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13021099#comment-13021099 ]
Julien Nioche commented on NUTCH-984:
-------------------------------------
Could you test the URLs above directly with Tika 0.9? I suppose this has to do with the default mappers used by Tika which we can override from Nutch.
BTW this illustrates why parse-html is still the default option for html and parse-tika is used for the other mime-types. I'd suggest that we mark this as fixed in 2.0 as 1.3 is about to be RCed. More generally the tests that are used for checking the html parsing need to be ported to parse-tika as well
> Parse-tika throws some URL's away
> ---------------------------------
>
> Key: NUTCH-984
> URL: https://issues.apache.org/jira/browse/NUTCH-984
> Project: Nutch
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.3, 2.0
> Reporter: Markus Jelsma
> Priority: Critical
> Fix For: 1.3, 2.0
>
>
> For some reason using parse-tika a crawl just wouldn't dive into some website news archive. The paging through the news archive is done with simple anchors:
> <div class="page active">1</div> <a href="/nieuws/overzicht/1/"><div class="page">2</div> </a> <a href="/nieuws/overzicht/2/"><div class="page">3</div> </a>
> I added some logging to DOMContentUtils:
> 2011-04-18 18:26:09,788 INFO tika.DOMContentUtils - Throw away link: http://www.site.nl/nieuws/overzicht/1/
> 2011-04-18 18:26:09,788 INFO tika.DOMContentUtils - Throw away link: http://www.site.nl/nieuws/overzicht/2/
> 2011-04-18 18:26:09,788 INFO tika.DOMContentUtils - Throw away link: http://www.site.nl/nieuws/overzicht/3/
> ...
> Now, this is rather funky. The code for private boolean shouldThrowAwayLink(Node node, NodeList children, int childLen, LinkParams params) is the same for parse-html and parse-tika. I also tested the two parsers between versions 1.2 and 1.3 for the following URL.
> http://news.bbc.co.uk/2/hi/europe/country_profiles/1154019.stm
> 1.2 - parse-tika: 196
> 1.2 - parse-html: 296
> 1.3 - parse-tika: 279
> 1.3 - parse-html: 296
> Something clearly improved in 1.3 but not generating the remaining URL's are a blocker for parse-tika in my case. Relevant configurations are the same parser.html.outlinks.ignore_tags is not being used. Testing has been done with ParserChecker only.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-984) Parse-tika throws some URL's away
Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chris A. Mattmann resolved NUTCH-984.
-------------------------------------
Resolution: Won't Fix
Looks like this is a Tika issue. If not, please let someone know or file a new issue.
Thanks!
> Parse-tika throws some URL's away
> ---------------------------------
>
> Key: NUTCH-984
> URL: https://issues.apache.org/jira/browse/NUTCH-984
> Project: Nutch
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.3, 2.0
> Reporter: Markus Jelsma
> Priority: Critical
> Fix For: 1.3, 2.0
>
>
> For some reason using parse-tika a crawl just wouldn't dive into some website news archive. The paging through the news archive is done with simple anchors:
> <div class="page active">1</div> <a href="/nieuws/overzicht/1/"><div class="page">2</div> </a> <a href="/nieuws/overzicht/2/"><div class="page">3</div> </a>
> I added some logging to DOMContentUtils:
> 2011-04-18 18:26:09,788 INFO tika.DOMContentUtils - Throw away link: http://www.site.nl/nieuws/overzicht/1/
> 2011-04-18 18:26:09,788 INFO tika.DOMContentUtils - Throw away link: http://www.site.nl/nieuws/overzicht/2/
> 2011-04-18 18:26:09,788 INFO tika.DOMContentUtils - Throw away link: http://www.site.nl/nieuws/overzicht/3/
> ...
> Now, this is rather funky. The code for private boolean shouldThrowAwayLink(Node node, NodeList children, int childLen, LinkParams params) is the same for parse-html and parse-tika. I also tested the two parsers between versions 1.2 and 1.3 for the following URL.
> http://news.bbc.co.uk/2/hi/europe/country_profiles/1154019.stm
> 1.2 - parse-tika: 196
> 1.2 - parse-html: 296
> 1.3 - parse-tika: 279
> 1.3 - parse-html: 296
> Something clearly improved in 1.3 but not generating the remaining URL's are a blocker for parse-tika in my case. Relevant configurations are the same parser.html.outlinks.ignore_tags is not being used. Testing has been done with ParserChecker only.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-984) Parse-tika throws some URL's away
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma closed NUTCH-984.
-------------------------------
Bulk close of resolved issues for 1.3.
> Parse-tika throws some URL's away
> ---------------------------------
>
> Key: NUTCH-984
> URL: https://issues.apache.org/jira/browse/NUTCH-984
> Project: Nutch
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.3, 2.0
> Reporter: Markus Jelsma
> Priority: Critical
> Fix For: 1.3, 2.0
>
>
> For some reason using parse-tika a crawl just wouldn't dive into some website news archive. The paging through the news archive is done with simple anchors:
> <div class="page active">1</div> <a href="/nieuws/overzicht/1/"><div class="page">2</div> </a> <a href="/nieuws/overzicht/2/"><div class="page">3</div> </a>
> I added some logging to DOMContentUtils:
> 2011-04-18 18:26:09,788 INFO tika.DOMContentUtils - Throw away link: http://www.site.nl/nieuws/overzicht/1/
> 2011-04-18 18:26:09,788 INFO tika.DOMContentUtils - Throw away link: http://www.site.nl/nieuws/overzicht/2/
> 2011-04-18 18:26:09,788 INFO tika.DOMContentUtils - Throw away link: http://www.site.nl/nieuws/overzicht/3/
> ...
> Now, this is rather funky. The code for private boolean shouldThrowAwayLink(Node node, NodeList children, int childLen, LinkParams params) is the same for parse-html and parse-tika. I also tested the two parsers between versions 1.2 and 1.3 for the following URL.
> http://news.bbc.co.uk/2/hi/europe/country_profiles/1154019.stm
> 1.2 - parse-tika: 196
> 1.2 - parse-html: 296
> 1.3 - parse-tika: 279
> 1.3 - parse-html: 296
> Something clearly improved in 1.3 but not generating the remaining URL's are a blocker for parse-tika in my case. Relevant configurations are the same parser.html.outlinks.ignore_tags is not being used. Testing has been done with ParserChecker only.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira