You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "senthil kumar (JIRA)" <ji...@apache.org> on 2014/07/23 07:00:41 UTC

[jira] [Commented] (NUTCH-1822) Page outlinks clearance is not appropriate

    [ https://issues.apache.org/jira/browse/NUTCH-1822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071370#comment-14071370 ] 

senthil kumar commented on NUTCH-1822:
--------------------------------------

http://lucene.472066.n3.nabble.com/Nutch-New-outlinks-removes-old-valid-outlinks-td4146676.html

> Page outlinks  clearance is not appropriate
> -------------------------------------------
>
>                 Key: NUTCH-1822
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1822
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 2.1
>         Environment: Nutch-2.1
> Hadoop-0.20.205
> HBase-0.90.6
> hbase-gora-0.2.1
>            Reporter: Riyaz Shaik
>
> 1. When a page is re-crawled and identified with new outlink urls along with the existing urls, old outlinks are getting removed and only new urls are updated to hbase. 
> Ex:
>  Crawl cycle 1 for www.123.com, identified outlinks are 
> ol  --> abc.com 
> ol --> pqr.com 
> Crawlcyle 2 of same www.123.com, the outlinks are
> (note that abc.com is removed and added with xyz.com) 
> ol --> pqr.com 
> ol --> xyz.com 
> At the end of crawlcycle 2, base has only xyz.com as outlink
> ol -->xyz.com
> Expected:
> ol --> pqr.com 
> ol --> xyz.com 
> 2. If some of the outlinks of the page got removed and no new outlinks are added to the page then page re-crawl is not clearing the obsolete/removed outlinks from hbase.
> Ex: Cycle 1 crawled page : www.test.com, identified outlinks are
> ol -->link1
> ol-->link2
> ol-->link3
> Cycle 2 same page(www.text.com) re-crawled, identified outlinks are
> (Note: only removed the link2 no new links are added)
>  ol-->link1
> ol-->link3
>  but the end of the cycle 2.,it has all the 3 outlinks in hbase
> in habse:
> ol -->link1
> ol-->link2
> ol-->link3
> expected:
>  ol-->link1
> ol-->link3
> As per the code ParseUtil.java, it seems to be removing the old links and insets onlythe new links. 
> if (page.getOutlinks() != null) { page.getOutlinks().clear(); }
> http://lucene.472066.n3.nabble.com/Nutch-New-outlinks-removes-old-valid-outlinks-td4146676.html
> Thanks
> Riyaz



--
This message was sent by Atlassian JIRA
(v6.2#6252)