You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "senthil kumar (JIRA)" <ji...@apache.org> on 2014/07/23 07:00:41 UTC
[jira] [Commented] (NUTCH-1822) Page outlinks clearance is not
appropriate
[ https://issues.apache.org/jira/browse/NUTCH-1822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071370#comment-14071370 ]
senthil kumar commented on NUTCH-1822:
--------------------------------------
http://lucene.472066.n3.nabble.com/Nutch-New-outlinks-removes-old-valid-outlinks-td4146676.html
> Page outlinks clearance is not appropriate
> -------------------------------------------
>
> Key: NUTCH-1822
> URL: https://issues.apache.org/jira/browse/NUTCH-1822
> Project: Nutch
> Issue Type: Bug
> Components: parser
> Affects Versions: 2.1
> Environment: Nutch-2.1
> Hadoop-0.20.205
> HBase-0.90.6
> hbase-gora-0.2.1
> Reporter: Riyaz Shaik
>
> 1. When a page is re-crawled and identified with new outlink urls along with the existing urls, old outlinks are getting removed and only new urls are updated to hbase.
> Ex:
> Crawl cycle 1 for www.123.com, identified outlinks are
> ol --> abc.com
> ol --> pqr.com
> Crawlcyle 2 of same www.123.com, the outlinks are
> (note that abc.com is removed and added with xyz.com)
> ol --> pqr.com
> ol --> xyz.com
> At the end of crawlcycle 2, base has only xyz.com as outlink
> ol -->xyz.com
> Expected:
> ol --> pqr.com
> ol --> xyz.com
> 2. If some of the outlinks of the page got removed and no new outlinks are added to the page then page re-crawl is not clearing the obsolete/removed outlinks from hbase.
> Ex: Cycle 1 crawled page : www.test.com, identified outlinks are
> ol -->link1
> ol-->link2
> ol-->link3
> Cycle 2 same page(www.text.com) re-crawled, identified outlinks are
> (Note: only removed the link2 no new links are added)
> ol-->link1
> ol-->link3
> but the end of the cycle 2.,it has all the 3 outlinks in hbase
> in habse:
> ol -->link1
> ol-->link2
> ol-->link3
> expected:
> ol-->link1
> ol-->link3
> As per the code ParseUtil.java, it seems to be removing the old links and insets onlythe new links.
> if (page.getOutlinks() != null) { page.getOutlinks().clear(); }
> http://lucene.472066.n3.nabble.com/Nutch-New-outlinks-removes-old-valid-outlinks-td4146676.html
> Thanks
> Riyaz
--
This message was sent by Atlassian JIRA
(v6.2#6252)